REMARKS 

Claims 19-32 are pending. Claim 1 has been cancelled and new claims 19-32 drafted for 
ease of examination. Support for new claims 19-32 derives from the specification and claims as 
originally filed. For example, support for computational methods for the generation of primary 
libraries are described at page 7, line 22, through page 8 line 12, and page 10, line 9 through 
page 15, line 14; methods for the generation of secondary libraries from primary libraries are 
described at page 26, line 27, through page 30, line 27; methods for the generation of tertiary 
libraries are described at page 34, line 18, through page 19; page 40, line 14; support for the 
synthesis of variant proteins, beginning with the corresponding oligonucleotide sequences using 
multiple PCR can be found at pages 31-32; methods for isolating, purifying, and expressing the 
ohgonucleotide sequences as proteins are well known in the art, and are described at pages 41-47 
and in the Examples; and, the use of a computer workstation comprising a microprocessor is 
described at page 64, lines 1-6. Accordingly, the amendments do not present new matter and 
entry is proper. 

Applicants thank the Examiner for withdrawing the rejection under 35 USC §112, second 
paragraph and the Double Patenting rejection. 

Rejections under 35 U.S.C. § 101 

Claim 1 is rejected under 35 U.S.C. § 101 for lacking a specific asserted utility or a well 
established utility. In rejecting claim 1, the Examiner's position appears to be that the claimed 
method generates a secondary library of undefined structure that is so general as to lack a real- 
world utility. The rejection is moot as applied to cancelled claim 1 . Apphcants respectfixUy 
submit that this rejection does not apply to newly added claims 19-32 for the following reasons. 

Newly added claims 19-32 disclose the following inventions: (1) computational methods 
for generating a secondary library of protein variants (independent claim 19 and dependent 
claims 20-21) and methods for generating a tertiary library of protein variants (independent 
claim 22 and dependent claims 23-29); and, (2) an application of a computer program product 
(independent claim 30 and dependent claims 31-32). 
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The Examiner's basic position appears to be that there is no "specific and substantial 
utility", citing In re Kirk, As a preliminary matter, the Applicants first note that In re Kirk is a 
case dated prior to the new Utility Guidelines, and secondly that Kirk is directed to compositions 
of a new chemical class with a sole utility of "useful biological properties". 

As to the first point, the Applicants respectfully draw the Examiner's attention to the 
UtiUty Guidelines: 

hi most cases, an applicant's assertion of utility creates a presumption of utility that will 
be sufficient to satisfy the utility requirement of 35 U.S.C. § 101 . As the CCPA stated in hi re 
Langer: 

As a matter of Patent Office practice, a specification which 
contains a disclosure of utility which corresponds in scope to the 
subject matter sought to be patented must be taken as sufficient to 
satisfy the utility requkement of § 101 for the entire claimed 
subject matter unless there is a reason for one skilled in the art to 
question the objective truth of the statement of utility or its scope. 

Thus, Langer and subsequent cases direct the Patent Office to presume that a statement of 
utility made by an applicant is true. For obvious reasons of efficiency and in deference to an 
applicant's understanding of his or her invention, when a statement of utility is evaluated, Patent 
Office personnel should not begin an inquiry by questioning the truth of the statement of utility. 
Instead, any inquiry must start by asking if there is any reason to question the truth of the 
statement of utility. This can be done by evaluating the logic of the statements made, taking into 
consideration any evidence cited by the applicant. If the asserted utility is credible (i.e., 
believable based on the record or the nature of the invention), a rejection based on "lack of 
utility" is not appropriate. Thus, Patent Office personnel should not begin an evaluation of utility 
by assuming that an asserted utility is likely to be false, based on the technical field of the 
invention or for other general reasons. 

Compliance with § 101 is a question of fact. Thus, to overcome the presumption of truth 
that an assertion of utility by the applicant enjoys. Patent Office personnel must establish that it 
is more likely than not that one of ordinary skill in the art would doubt (i.e., "question") the truth 
of the statement of utility. To do this. Patent Office personnel must provide evidence sufficient to 
show that a person of ordinary skill in the art would consider the statement of asserted utility 
"false". A person of ordinary skill must have the benefit of both facts and reasoning in order to 
assess the truth of a statement. This means that if the applicant has presented facts that support 
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the reasoning used in asserting a utility, Patent Office personnel must present countervailing 
facts and reasoning sufficient to establish that a person of ordinary skill would not believe the 
applicant's assertion of utility (MPEP §2107.02111 A). The initial evidentiary standard used during 
evaluation of this question is a preponderance of the evidence (i.e., the totality of facts and 
reasoning suggest that it is more likely than not that the statement of the applicant is false). It is 
respectfully submitted that the Examiner has not met this burden. 

The claims are directed to specific methods of computationally generating libraries of 
proteins. As has been argued previously, these methods have a "real world" utility, as evidenced 
in several ways. First of all, a "real world" use has been shovm because methods of protein 
design related to those of the present invention have been shown to work as claimed. See also 
U.S. Patent Nos. 6,188,965; 6,296,312; 6,403,312; 6,708,120; 6,792,356; PCT/US98/07254 and 
PCT/USOl/40091 . Such methods have been used to generate novel proteins with enhanced 
properties, see for example, U.S. Patent Nos. 6,682,923; 6,627,186; 6,514,729; and 6,746,853. 
See also. Steed et al. Science (2003), 301 : 1895-1898, a copy of which is enclosed as Exhibit A; 
Hayes et al, PNAS, 99 (25): 15926-15931, a copy of which is enclosed as Exhibit B; and Luo et 
al, Protein Science (2002), 11: 1218-1226, a copy of which is enclosed as Exhibit C. AppUcant 
also notes that the methodology described in these patents and scientific publications is not 
limited to enzymes, but applies to therapeutic proteins as well as any other type of proteins. 

In further support of utility, the utility of these methods are recognized by those of skill in 

the art as useful techniques. A number of third parties have recognized the value of these 

methods. For example, in the article "Proteins from Scratch" (DeGrado, Science (1997), 278:80- 

81, a copy of which is enclosed as Exhibit D), biochemistry professor William F. DeGrado of the 

University of Pennsylvania School of Medicine, a world-renowned expert in protein structure, 

folding and design, comments on the computational platform designed by Dahiyat and Mayo in 

Science (1997), 278:82-87. This platform is an earUer version of the computational platform that 

has evolved and is claimed herein. Dr. DeGrado states: 

Not long ago, it seemed inconceivable that proteins could be 
designed from scratch. Because each protein sequence has an 
astronomical number of potential confirmations, it appears that 
only an experimentalist with the evolutionary life span of Mother 
Nature could design a sequence capable of folding into a single, 
well-defined three dimensional structure. But now on page 82 of 
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this issue, Dahiyat and Mayo describe a new approach that makes 
de novo protein design as easy as running a computer. 

Dr, DeGrado further states (col 1, paragraph 3): 

Thus, the problem of de novo protein design reduced to two steps: 
selecting a desired tertiary structure and finding a sequence that 
would stabilize this fold. Dahiyat and Mayo have now mastered 
the second step with spectacular success. They have distilled the 
rules, insights and paradigms gleaned fi"om two decades of 
experiments into a single computational algorithm. . .Thus the rules 
of . . .computational methods for de novo design may now be 
sufficiently defined to allow the engineering of a variety of 
proteins. 

Thus, as can be seen fi-om the selections cited above in Dr. DeGrado 's article, Dr. 
DeGrado is commenting on the usefulness of the general method. Thus, Applicants respectfully 
dispute the Examiner's statement that "the article by DeGrado relates to the specific compound, 
Zinc finger protein". Dr. DeGrado is specifically discussing the computational design of Mayo 
and Dahiyat, not just a zinc finger protein. 

Furthermore, Applicants respectfully submit that the Examiner has misunderstood 
DeGrado by his reference to the quotation at page 80 citing "de novo design is best approached 
by simultaneously considering all of the side chains in the protein-unfortunately, a very high 
order combinatorial problem". It is this very paragraph that goes on to discuss Dayhiyat and 
Mayo's DEE theorem to "efficiently search through sequence and side chain rotamer space" (see 
column 3, page 80, last sentence of second full paragraph). Thus the DeGrado article articulates 
that the Dahiyat and Mayo solution, which forms the basis of the present claims, is in fact very 
useful in the field of combinatorial evaluation. 

Further, in 2002, Dr. Jeffery G. Saven, a well-known expert in protein design, has 

recently published a review of the state of the art in combinatorial protein libraries (see, Saven, 

JG, Curr. Op. Struct. Biol. (2002), 12:453-458, a copy of which is enclosed as Exhibit E, where 

he states at page 456, col. 1, 3rd paragraph, lines 6-13: 

Not only can combinatorial methods be used for discovery but 
also, more deeply, they can inform our understanding of protein 
properties by generating and assaying whole ensembles of 
sequences. Traditionally, advances in structm-al biology have 
come fi-om examining the structures of naturally occurring 
proteins, but with combinatorial experiments, an enormous 
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diversity of sequences can be generated at the control of the 
researcher. 

Saven also states that 

Thus, methods for winnowing and focusing sequence space are a 
viral component of combinatorial protein design (see page 453, 
column 1, first paragraph) . . . Combinatorial methods are powerful 
tools for cases in which we have an incomplete understanding of 
molecular properties. 

The Saven publication, while not prior art in the instant application, shows that it is 
known in the art that combinatorial library generation has "real world use". Thus, the 
discussions above regarding examples of actual utility by Applicant, as well as recognition to 
those skilled in the art of protein design and combinatorial library generation, meets the utility 
requirement under 35 USC § 101. 

With respect to the follow on statement by Examiner, "or human grovrth hormone by 
Filikov", it is respectfully submitted that the human growth hormone was designed using the 
protein design automation computer program described in the recited claims. The other cited 
improved protein publications and patents were designed using the protein design automation 
program defined in the claims. These are examples of the breadth of the program. The uses are 
not limited to enzymes, but to any protein. In addition, the methods used to identify such 
improved proteins are used in this instant case. 

The Examiner cites In re Kirk for the proposition that " We do not believe that it was the 
intention of the statutes to require the Patent Office, the courts or the public to play the sort of 
guessing game that might be involved if an applicant could satisfy the requirements of the 
statutes by indicating the usefulness of a claimed compound in terms of possible use so general 
as to be meaningless..." (emphasis by Examiner). 

Applicant's initially point out that the Kirk case relates to a compound, not a method. 
Further, the methods recited in the present claims recite specifically defined steps that are 
understandable to those skilled in the art of computational biology and chemistry. The elements 
of the claims, including the use of scoring functions and probability distribution tables (claim 
19), using PDA®, synthesizing and screening library sequences (claim 22) and using PDA® as 
well as a probability distribution table (claim 30) support a finding of utility as defined in 35 
U.S.C. §101.**TT 
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In addition, the Applicants respectfully draw the Examiner's attention to the requirements 
as further outlined in the Guidelines: 

Where an applicant has specifically asserted that an invention has a 

particular utility, that assertion cannot simply be dismissed by 

Office personnel as being "wrong," even when there may be reason 

to believe that the assertion is not entirely accurate. Rather, Office 

personnel must determine if the assertion of utility is credible (i.e., ^ 

whether the assertion of utility is believable to a person of ordinary 

skill in the art based on the totality of evidence and reasoning 

provided). An assertion is credible unless (a) the logic underlying 

the assertion is seriously flawed, or (b) the facts upon which the 

assertion is based are inconsistent with the logic underlying the 

assertion. Credibility as used in this context refers to the reliability 

of the statement based on the logic and facts that are offered by the 

applicant to support the assertion of utility. 

* * * 

... a prima facie showing [of no specific and substantial credible 
utility] must establish that it is more likely than not that a person of 
ordinary skill in the art would not consider that any utility asserted 
by the applicant would be specific and substantial. 

Thus, the burden is shifted to the Examiner. The Applicants respectfully submit that this 
burden has not been met, and the rejection should be withdrawn. 

The arguments made above with respect to 35 USC §101 are equally applicable to the 
rejection under 35 USC §112, first paragraph. The techniques described in the recited methods 
have a specific and well-established utility, and one skilled in the art would know how to use the 
claimed invention, particularly as demonstrated in the patents and scientific articles discussed 
above. 

Lack of Utility under §112, 1'^ Paragraph 

As argued above, there is sufficient utility under both 35 U.S.C. §§101 and 1 12 to meet 
the statutory requirements, and this rejection should be withdrawn. 

Rejections under 35 U.S.C. § 112, first paragraph 

Claim 1 is rejected under 35 U.S.C. 112, first paragraph for failing to comply with the 
written description requirement. In rejecting claim 1, the Examiner's position appears to be that 
the specification is enabling only for the design of enzymes (see page 5 of the final office 
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action). The rejection is moot as applied to cancelled claim L Applicants respectfully submit 
that this rejection does not apply to newly added claims 19-32 for the following reasons. 

The Applicants acknowledge the Examiner's statement that the specification is enabling 
for methods utilizing enzymes, but respectfully disagree that other types of proteins are not 
enabled. 

As stated infra regarding utility, Applicant respectfully submits that the application is 
enabled by the examples where a molecule whose coordinates were input into a computer, heavy 
side chain atoms were selected within a 4 Angstrom sphere around four catalytic residues. These 
heavy side chain atoms defined the variable residue positions for which a primary library was 
calculated. A probability table (Table 3) was calculated from the top 1000 sequences in the list 
(again see Table 3). Table 3 shows the number of occurrences of each of the amino acids 
selected for each position (i.e., 5 variable positions and 25 floated positions). One skilled in the 
art would readily be capable of extrapolating these examples to a variety of protein systems with 
a variety of functions, particularly when read in Ught of the specification (e.g. see Specification 
page 7, line 27 to page 9, line 5; page 34, line 22 to page 35, line 12). Thus these examples also 
show enablement. 

With respect to the scope of the enabling disclosure not conmiensurate with the scope 
provided in the Specification, there is disclosure of using a computational design program, and 
preferably PDA® technology as embodiments of the invention. See Specification at page 2, lines 
1-3; page 7, lines 9-12; and page 14, line 30 to page 15, line 5. In addition, the examples provide 
further enabling disclosure to one skilled in the art to practice this invention. As stated 
previously, the methodology is not limited to a particular kind of protein, and one skilled in the 
art would not be led to believe that this method is limited to enzymes. The method of the present 
invention is not limited to enzymes, since the modifications may be done to any proteins, not just 
enzymes. The methodology has been successfully employed in many non-enzyme proteins, e.g., 
TNF, GCSF, Interferon, etc. The publicafions cited in the secfion addressing the 35 USC §101 
show the diversity of proteins that may be used. In addition, the article by Dr. Saven shows that 
those skilled in the art do not limit proteins by type (such as enzymes). The methodologies apply 
to any type of protein. The methodology requires that coordinates of a target protein be input. 
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There is nothing in the methodology that so limits it only to enzymes, and while the examples 
show enzyme modifications, these examples are just that, examples of how the technology 
works. The specification provides support for the use of any protein that may be used in this 
method. One skilled in the art would understand that this method may be used on any protein 
and not just limited to enzymes. 

The Examiner cites the DeGrado reference at page 80 (See Office Action, page 6, first 
full paragraph). Applicants have two points; first of all, the applicants are not designing a 
protein de novo, which is the subject of the DeGrado quote, but are inputting the coordinates of a 
target protein. Liputting the coordinates of a target protein is the equivalent to enabling the 
analysis of that particular protein structure. The methodology employs known physio-chemical 
parameters of proteins, amino acids and rotamers to modify the target protein. Secondly, 
DeGrado also actually discusses the fact that this "very high combinatorial problem" is 
addressed by the Dahiyat and Mayo technique. Thus, DeGrado also supports a finding of 
enablement of the present techniques. 

Thus for every protein (not just enzymes), the same methodology as recited in the instant 
claims is used. 

There is no undue experimentation since the specification enables one skilled in the art to 
practice the invention using the specifically recited steps in the claims. The Examiner refers to 
Cys, Pro and Gly not being used in an Example in the specification. Applicants' respectfully 
refer the Examiner to page 17, lines 30-35, where the specification discloses the basis behind 
using, or not using certain amino acids in certain situations. To one skilled in the art of protein 
design, this is not imdue experimentation but a design choice. With respect to the Examiner's 
comments regarding S02 and water being removed, Applicants' respectfully refer the Examiner 
to page 15, lines 6-25 for the discussion on backbone structure preparation, as well as the 
discussion on backbone preparation above. 

Applicants respectfizlly point to In re Goffe, 191 USPQ429 (CCPA 1976), where the 
court stated: 

For all practical purposes, the Board would limit Appellant to 
claims involving the specific materials disclosed in the examples, 
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so that a competitor seeking to avoid infringing the claims would 
merely have to follow the disclosure in the subsequently issued 
patent to find a substitute. However, to provide effective 
incentives, claims must adequately protect inventors. To demand 
that the first to disclose shall limit his claims to what he has found 
to work or to materials which meet the guidelines specified for 
"preferred" materials in a process such as the one herein involved 
would not serve the constitutional propose of promoting progress 
in the usefiil arts. 

Additionally, in In re Angstadt, 190 USPQ 214, 218 (CCPA 1976), the court further 

stated: 

Appellants have apparently not disclosed every catalyst which will 
work; they have apparently not disclosed every catalyst which will 
not work. The question, then, is whether in an unpredictable art, 
section 112 requires disclosure of a test with every species covered 
by a claim. To require such a complete disclosure would 
apparently necessitate a patent application or applications with 
"thousands" of examples or the disclosure of "thousands" of 
catalysts along with information as to whether each exhibits 
catalytic behavior resulting in the production of hydroperoxides. 
More importantly, such a requirement would force an inventor 
seeking adequate patent protection to carry out a prohibitive 
number of actual experiments. This would tend to discourage 
inventors from filing patent applications in an unpredictable area 
since the patent claims would have to be limited to those 
embodiments which are expressly disclosed. 

Therefore, in conclusion. Applicants submit that the Specification taken in conjunction 
with the state of the art at the time the invention was filed fully enables a person skilled in the art 
to practice the method of the invention without undue experimentation. Applicants respectfully 
request reconsideration and withdrawal of the rejection. 

Applicants respectfully submit that the specification enables a method for 
computationally generating a genus of secondary libraries comprising variant sequences in which 
the starting protein structure (i.e. target protein or scaffold protein) can be any protein for which 
a three dimensional structure is known or can be generated. In addressing the written description 
requirement under 35 U.S.C. § 1 12, the Federal Circuit in University of California v. Eli Lilly 
and Co., 43 USPQ2d 1398, 1406 (Fed. Cir. 1997), stated: 
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A description of a genus of cDNAs may be achieved by means of a 
recitation of a representative number of cDNAs, defined by 
nucleotide sequence, falling with the scope of the genus or of a 
recitation of structural features common to the members of the 
genus, which features constitute a substantial portion of the genus. 
This is analogous to enablement of a genus under Section 112, 
para. 1, by showing the enablement of a representative number of 
species within the genus. See Angstadt, 537 F.2d at 502-03 
(deciding that applicants "are not required to disclose every species 
encompassed by their claims their claims even in an unpredictable 
art and that the disclosure of forty working examples sufficiently 
described subject matter of claims directed to a generic process) . . 
. See also In re Grimme, 21 A F.2d, 949, 952 ("[I]t has been 
consistently held that the naming of one member of such a group is 
not, in itself, a proper basis for a claims to the entire group. 
However, it may not be necessary to enumerate a pluraUty of 
species if a genus is sufficiently identified in an apphcation by 
other appropriate language."). 

In support of the position that Applicants' have designed many proteins that are not 
"enzymes", Applicants enclose herewith a number of publications that are both prior and 
subsequent to the filing date of the present application. These are not offered to augment the 
disclosure of the application; rather, the work is presented to show that present invention is 
enabled for any protein for which a defined set of coordinates can be generated. See In re 
Wilson, 135 USPQ 442, 444 (CCPA 1962); Ex parte Obukowicz, 27 USPQ 2d 1063 (BPAI 
1993); Gould V. Quigg, 3 USPQ 2d 1302,1305 (Fed. Cir. 1987): 

"it is true that a later dated publication cannot supplement an 
insufficient disclosure in a prior dated application to render it 
enabling. In this case the later dated publication was not offered as 
evidence for this purpose. Rather, it was offered ... as evidence 
that the disclosed device would have been operative" printed 
publications. 

For example, the enclosed articles describe computationally designed GCSF (US 
6627186 and Luo P et al., Protein Science 11, 1218-1226 (2002); enclosed herein as Exhibits A 
and B), Interferon Beta (US 6514729, enclosed herein as Exhibit C) and TNF-alpha (US 
publication No. 2003/138401 and Steed PM et al, Science 301, 1895-1898 (2003); enclosed 
herein as Exhibits D and E), for example. These non-enzymatic proteins have a variety of 
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structures and have all been successfully designed. Thus, it is improper to limit the scope of this 
invention to just "enzymes". 

Applicant's respectfully point out the new claim 22 specifically recites PDA® in 
response to the Examiner's statement at page 6 first full paragraph of the Office Action. PDA is 
a preferred embodiment of Applicant's invention, but this particular computational approach is 
not necessarily required. Applicant's also respectfully dispute that the recited method claims are 
"the shot in the dark, genetic approach" (see Office Action, page 6, first full paragraph). The 
approach is rational, not random ("shot in the dark"). Applicant's again reiterate that one skilled 
in the art would be enabled to practice the steps of the method without undue experimentation. 

The articles, patents and patent applications discussed above, support the enablement of 
the methods disclosed in the pending claims. Importantly, the methods apply to proteins in 
general, regardless of whether the protein is an enzyme, as described in the example, or an 
antibody, cell surface receptor, or other protein of interest. 

Accordingly, Applicants respectfully submit that the specification fully enables the 
present claims, and respectfully request withdrawal of the rejection under 35 U.S.C. § 1 12, first 
paragraph. 

Rejections under 35 U.S.C. § 102 

Claim 1 is rejected under 35 U.S.C. § 102(b) as being anticipated by Fechteler et al, 
1995, J. Mol Biol, 13:114-31. In rejecting claim 1 , the Examiner's position appears to be that 
at page 128, that designing a protein model is the same concept as generating a second variant 
library, and thus, Fechteler describes the same computational method as taught in the instant 
application. The rejection is moot as appUed to cancelled claim 1. Apphcants respectfully 
submit that this rejection does not apply to newly added claims 19-32 for the following reasons. 

To anticipate a claim under 35 U.S.C. § 102(b), a reference must teach every element of 
the rejected claim (MPEP § 2131). 

Applicants respectfully submit that Fechteler teaches a method for predicting the three- 
dimensional structure in insertion /deletion regions of a protein structure that combines cluster 
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analysis with a geometric scoring criteria. Fechteler uses clustering with geometric criteria to 
narrow the list of fragment options when attempting to fit structural fragments onto the existing 
template structure. The modeling in Fechteler always takes place for a single unique protein 
sequence (i.e. although structural variants are created, no sequence variants are created). 

Applicants also respectfiilly reiterate that Fechteler does not teach synthesis of proteins of 
the invention - the methods sections on pages 128-129 of the Fechteler reference merely spell 
out the details of the Fechteler structure prediction method, and the only entity that can be 
produced by the method of Fechteler is a theoretical list of 3-D coordinates for placement of 
atoms in space. Indeed, as Fechteler was completely focused on predicting the structure of a 
protein based on its sequence, the method was only applied to sequences that had already been 
synthesized and characterized - that is, the order of application is reversed relative to the 
Applicants' method in which variant sequence libraries are designed and then produced. 

Further, the Fechteler reference does not create novel variant sequences. All of the 
sequences identified are the same as in the initial set. 

In contrast to Fechteler, claims 19-32 use physico-chemical scoring ftmctions (e.g. van 
der Waals, hydrogen bonding, etc.), probability tables and protein design automation to 
computationally filter variant protein sequences and generate a primary list of variant proteins. 
The current invention then fiirther generates a secondary library of variant protein sequences by 
combining a plurality of variant amino acid residues. There is also no discussion or teaching in 
Fechteler of combining a plurality of their database fragments. Fechteler does not teach or 
suggest the use of scoring fiinctions, probability tables, protein design automation, or the design 
of variant protein sequences and libraries. 

Hence, Fechteler does not anticipate the claimed subject matter. Withdrawal of the 
rejection under 35 U.S.C. § 102(b) is requested. 

The Examiner is invited to contact the undersigned at (415) 781-1989 if any issues may 
be resolved in that manner. 

Respectfiilly submitted, 
DORSEY & WHITNEY LLP 
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Suite 3400 

San Francisco, California 941 1 1-4187 
Telephone: (415) 781-1989 
Fax No. (415)398-3249 



By: ^^m^A- 

Robin M. ^ilva, Reg. No. 38,304 
Filed under 37 C.F.R. § 1.34(a) 
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EXHIBIT A 
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^^^^ of TNF Signaling by 

Rationally Designed 
Dominant-Negative TNF Variants 

»^»»rtW^teed/MaloaTan,ey/tJo^ 
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and hai been implkated In many |>athologlcal conditions We u*^ \ 
^ucture-based design to engineer variant TNf proteins that rapWW Sm 
heterotnmwwrt^ 

stimulate signaling through TNF receptors. Thus/ TNF Is Inactivated bv 

I'^^'lT \^'''^^^^^ '^P^*^""* • P^«'We approach to 

ant^nflammM^ 

that the strategy can attenuate TNF-mcdiated pathology. Similar rational 
d^ten could be used to engineer Inhibitors of additlonS^TN^su;^^^^^ 
cytokines as weU as other multimerlc llgands. P^namny 
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TNF is a iHoioflammatoiy cytokine that can 
conylcx two TNF receptors, TNFRl (p55) and 
TNFR2 (p75), to activate signaling cascades 
coatroIUng apoptode, inflammati<m, cell prolif- 
cfadon, and the tnmume response {l-Sy The 
264cp type fl transmenibcane TNF precuisOT 
protein, cjqntssed on many cdl types, is pio- 
tcolydcaSy concvcttod into a sohtble 52-1(D ho- 
motmner {6\ An elevated senim Icvd of TNF 
is associated with the pathophysiology of rhcu- 
inatoid arthritis (RAX inflammaioiy b^ 
case, and ankylosing spondylitis (7, 7, i), and 
molecules that inhibit TNF signaling have dem- 
cmstrated clinical efficacy in treating some of 
these diseases (P. IQj^ 

We have engineered dominant-negative 
TNF pN-TNF) variants that inactivate Che 
native homotrimcr by a sequestration mech- 
anism that blocks TNF bloactivity (fig. SI). 
Protein design automation (PDA)» an in silico 
method that predicts protem variants with 
improved biological properties (/7-7i), was 
used to introduce smgle or double ammo acid 
changes into TNF (Fig. IA) to generate the 
desired, biological profile while maintaining 
the overall structural integrity of the mole- 
cule. Specifically, our goal was to design 

yiuieii U5A. 

*1he$e authofs comrfimted e(^y to thb vrartu 
IftvstiTt Mldrcsc Oepartment of Physiology, Unlver- 

sfty of Texas $outh««stem Medical Center « t>8llat 
Datbs. TX 7S39a USA . 
tTo wbom conrespofKfeiKe should be addresMd. 
maO: bai©xencorxom 



homotrimcric TNF variants that (i) have dc- 
cfcascd receptor binding, (ii) sequester native 
TNF homotrimers from TNF leccptora by 
formation of inactive native.variant heteiotri- 
mm, (iii) aboUsh TWF signaling in rdcvani 
biological assays, and (iv) are easily ex- 
pressed and purified in large quantities fiom 
bacteria. Variants were tested for TNF recep- 
tor activation m cell-based assays, and non- 
agonistic variants were then chcdccd for then- 
ability to antagonize native TNF in cell and 
animal models. Subsequently, we evaluated 
asseiftbly state, receptor binding, and hctcro- 
trimer formation for several variants. 

The oomputatiaiial design strategy used 
crystal structures of native and variant TNF 
trimera as templates for dw simulations. Anal- 
ysis of a homology model of the TNF-iecqjtor 
complex revealed several <fi$tiiic<rpgions of the 
cytokine that make multiple direct contacts with 
its recqrtocs (Fig. 1 A), mduding interfaces ridi 
<n Hydroj^wbic and dectnastatic iiiteracttons. 
We tan simulations to select ooninimunogcnic 
point mutations lhat would disn^ receptor in- 
teractioos whfle |»escrvbg the stmctural integ- 
ri^ of the TOF variants and thwr abiWy to 
fisscoiblc into heterotrimers with native TNF 
{I4y Many of the designed variants dis- 
played maricedly reduced binding to TNFRl 
andTNFR2, and several combinaticms of po- 
tent single'mutaticHis fiirther decreased binding 
<Fig. IB and fig S2). As predicted by analysis 
of the TNF-TNFR strucwral complex, combi- 
ni^9 of tf>e most potent single mutations at 
diflfertnt mteractkm domains (e.g^ AI45R and 



www.sciencemag.Ofg saENa VOt 301 26 SEPTEI^BER 2003 



ia95 



^ 1 



RCPOftTS 

I97T) weic fiequeotly additive or syoergistic. 
Moreover, cur data ejctend Jesuits of fnevioua 
studies (IS, 16) in demonstrating that oatEiin 
substitutions can alta tibM ipectficity of toccptor 
bteractions; for exan^te. I97T and A145R/ 
I97T show gctaler felative binding tp TNFR2 
fluui to TNFRI 0% IB and fig. S2). 

Homotrimen of several designed TNF vari* 
ants exhibited >10,00O-foM leductian in their 
ability lo activate two nui|or signaling pathways 
downstream of TNF receptor activatiicm. Spe- 
cifically, single variants such as YBTH and 
A145R and the double vadant A145R^87H 
were unable to bind to either TNFRI or TNFR2 

Og. 1. DN-TNF variants 
have impaired TNF recep- 
tor binding and signaling. 
(A) Stnictural schematic 
of human TNF trimer- 
TNFR1 complex with ma- 
jor contacts between 
lifmd and receptor high* 
lotted by soUd surfaces 
(9«en). Locations of rep^ 
resentative mutated resi- 
dues substituted In do- 
minant-negative variants 
are shown in boxes, (ft) Increasing concentra- 
tions of ON-TNF homotrimen were Incubated 
with a fbced concentration of either TNFIll 
(blade bar) orTNFR2 (white bar), and the bind- 
ing affinity (IfJ was measured. The histogram 
Olustntes the effea of mutations on bindlr^ 
affinity between ON-TNF variant homotrlmers 
and TNF receptors. (C and D) To measure TNF- 
Induced signaling* we incubated Increasing con- 
centrations of nathw TNF (•) or the variants 
ri44N (O). I97T (q. Y87H (■). A145R {^). 
A145R/Y87H (a), and A14Sll/l9rT (O) with 
either U937 ceUs to measure TNF-lnduced 
caspase activity (C) or HEK 293T ceUs trans- 
fected with an NF-nMudferase reporter pUs- 
mld to measure TNF*lnduced transcriptional 



in cen<free assays (Fig. IB. and fig. S2) and 
failed to acthrate eidier caspase (Hg. IQ or 
nuclear aictor ids (NF-KBHocdiated lucifera^ 
expression (Fig. ID) relative to native TNF. In 
c(mttast, diosc variants that still bound TNFRs 
also activated IT^ signaling; in some cases 
(e-g., F144N) more potently tiian native TNF. 
Disruption of two xecqptor intet&oes (e.g., 
A145R/y87H or A145R/I9TI) effisctively de- 
stroyed the residual agonism (tetected widi 
some siagi&i)oint TNF variants. Futhcanore, 
the inqx>ftanoe of using multiple screening cri- 
teria to evaluate DN-TNF bioacdvity was ce« 
vcaled by variants such as A 145R/I97T, which 



had virtually no TNF-like activity in either cell, 
based assay yet displayed ^prcciablc TNFR2 
binding affinity. 

We subsequently tested nonagonistic INp 
variants for their ability to act as dominant' 
negative inhibitnrs by measuring. tfieir a|)actty 
to block native TNF activity in cell-based assays, 
We evahiated dose-clependent INF antagonina 
{14) by mixing increasing conoentzatifms of 
variants wttfi native INF (5 n^m!) for 1.5 houis 
and measuriQg the caspase activity induced by 
tiiese odxtares after addition to U937 ceUs (Fig 
2A). At concennaticms as lowas twofold dtatof 
native TNF (10 retail), AI45R/Y87H and 
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activation (0)^-TNF variants, espodafty the double mutants, have reduced TNF reccptorblnding and signaling activity. ftLU, relative Uiclferase untts 
Caspase actMty, arbitrary units noftnaUaed to ^ . -r ^ 




-10 12 3 4 

Ufl [DN-TKF or Inhlbitora (ng/mOI 



t/»glTNFwtth30Mdi 
DN-TNF or lnhlt»Hora (tipMJ 



Rg. 2. ON-TNF variants inhibit TNF^medlated IntraceUuUr slgnaiinft (A) 
inhibition of caspase acUvation. Native TNF (5 ng/mO was mbced whh 
buffer [•) or vrtth Increasing concenUBttons of ^ther TNF vari^ 
A145R (a). A145R/Ya7H (*). or A14SR/l9rT (O). the soluble FC-TNFR2 
iusloo eunercept (v), or the TNF monodon^ ^n^ody inf&)dmab (▼). 
After \S hours of Incubation In exchange buffer (f ^, these mixtures 
were applied to 093/ cdU to stimulate caspase activity. (B) Inhibition of 
NFk8 pathway activation. Native TNF (#) (25 Hg/mQ was mbced with 
20-foW excess (by mass) of A14SR (a), A14SR/Y87H (a), A145R/I97T 



(O). etanercept ('X or infliximab (▼), These mbctures were seriatf/ 
diluted and applied to HEK 293T celts for 12 hours to Induce NF-it^ 
ludfcrase reporter actMty. (g A145R/Yfi7H Inhibits naUvc TNF-lnduced 
nudear translocation of ^ p65-RelA subunit of NF-kB. hnmunofluort^- 
cence studies show subceUutar locBtlzation of NF-icB In Hela cells treated 
with buffer (panel 1), native TNF (10 ng^ml) (pand 2); A145RA«7H 
(100 n^rrX ) (panel 3). or the combination of native TNF and variant 
A145tt/Y87H (panel Ay RLU. relative hkiferas^ units; Caspase actMty* 
ari>itrary units normalized to Scale bar. 25 jun. 
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AU5R/I97T attenuated TNF-ioduccd caspase 
Ktivity by 50%, and at 20-fold excess, activity 
WIS leduced to baseline. The in vitro poteocy (by 
mass) of these variants b comparable to ihat of a 
sduible Pc-1NFR2 Hidoo (etaneicept) and more 
potent than fliat of an antibody to TNF (inflix- 
imab), two niaiketed anti-TNF tfieiapies, svp- 
|KHting the pc^tial utility of this otechanisnL 
Similariy, at 20-ibld excess over native TNF» 
doglofdnt (At45R, Y87H) and particu- 
late doubbpoint (AI45R/Y87H. AJ45R/OTT) 
variants decreased caspase activatioD (fig. S3) as 
wefl as TNIMnduced transcr^ana] activation 
by NF-kB in human embryonic Iddoey (HEK) 



293T cells (Fig: 2B). Consistent with these re- 
suits, die TNF variant AiASR/YSlH (at IWold 
excess over native TNF) blocked T^Q^-4nduced 
nudear translocation of Ifae NF-kB p65-RclA 
subunit in HeLa cells (Ftg. 2C). lius, a number 
of variants neutralized TNF-induced caspase and 
NF-KE-inedialed transcriptional activity over a 
wide range of native TKF conccotiaticHis, in- 
chiding the clinically relevant range of 100 to 
200 pg/ml found in the synovia! floid of RA 
patients {!7-I7^ 

To demonstiate that the mechanism d'TNF 
inhibition lequiies die Sanation of heterotri- 
mmc complexes with native INF^ we measured 
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fte 3. ON*TNF variants Inhibit slgnaGng by sequestering native TNf Into inactive hetcrotrlmcrs. 
A) Inverse correlation between hetcrotrlmer formation and caspase activity. Native TNF was 
mbced In exchange buffer (7^ with A145R/y67H A14SR/I97T (P, M), or etenercept (O. 
as described for Fte. 2. A part of each mbctire was analyzed by a sandwich EUSA to detect 
natlvenrartant TNF (open symbols), and the remainder was used to sthnuiate caspase activity In 
U937 cells (dosed symbols). Caspase activity* arbitraiy units normalized to V_ (B) Native gel 
analysis of hetcrotrlmer forniation with various ratios of native (N) and DN-TNF(V). FLAC-tagged 
native TNF was incubated alone (N,%, lane 10K>) or with inaeasing concentrations of Hts-tagged 
variant A145R/Y87H (lanes 10;1 to 10:100) before native gel electrophoresis to detemikw 
heterotrfmer formation. The differences in isodectic polrrt con^rred by the epitope tags aUowed 
for resolution of aU possible trimer spedes (NjtVo, N^iV,. N,:V2, and NqiVJ, Incr^asfeig concen- 
trations of DN-TNF variant caused the redistribution of native TNF Into both hcterotrimer^ and at 
10-fold excess all detectable native TNF was consumed. 




A145fVY87H 



F(g, 4, DN-TNF variants exhibit effkacy In vhro. (A) Effect of heterotrimers of various ratios In the 
^lactosamlne-senslUzed mouse model of human TNF-Induced endotoxemia. Native human TNF 
was dosed at 30 »ig/l;g and A145ft/Y87H was dosed at tiie Indicated ratios to a fixed native human 
TNF dose of 30 |tg/kc except at the 1:50* ratio, where A145R/Y87H was dosed In 50-fbld excess 
of native Iwman TNF (75 itg/kg). Uvers were handed and samples were blinded and scored for 
apoptotic damage on a scale of ) to 4 as described (14 < 005. (B) EfTicaw of A145IVY87H 
in the rat 7-day established CIA model A145R/Y87H was modified to tatioduce a PEC m^eW at 
H^rlli^l ? non-epltope-tagged moteculi" as descra>ed (ff). One group of four animals was 
jwwrtnrltlc m m remaining anlnuds %vere collagen treated and, after the onset of symptmAs 
they were randomized Intogroups of eight Animals vwe treated wWi vehicle (0), var^nt at 10 
JgAgtwkM dally d^lng pj. or variant at 2 mgfkg subcutaneously wJtii an Intravenous loadlrw 
oose of 2 njg/kg on tiie first treatment day (■). Mcasuremenu of anWe <fiamcter were made daily 
by oUper, *P < 0.05. , , • 



REPORTS 

die lelatioQ between heterotrimer levels and in- 
hibition of TNF-induced signaling (7^). We 
generated betcrotrimeric c(MX^>lexes by tnlxiog a 
fixed amount of FLAO-tagged native TNF witfi 
increasing conocntFations His-tagged TNF 
variants. A part of tfais matcrial was used in a 
sandwich enzyme-linked iomninosorbeot assay 
(EUSA) ^ig. 3A, open symbols) to detect tiic 
focmatioQ of His-FLAC hetcfotrimerfi, and die 
remainder was ^lied to U937 cells to detect 
TNF-naediatod caspase activation (Fig. 3A, 
closed symbotsX The extent of heterouiincr for- 
matioQ of AI45R/Y87H or A145R/I97r with 
native TKF conclated witii a decrease in 
caspase activation, demonstrating an inverse re- 
lation between signaling and heterotrinier for- 
matioa. As expected, etanetcept activify is inde- 
pendent of TNF numomer exchange (fig. 3A, 
open drcles) because ctaneroept binds to die 
TNF trimer. To <Siectty visualize hetecotrimcr 
formation, we mixed FLAO-tagged native TNF 
widi His-^gged DN-TTO' and resolved die ex* 
changed products using native pofyaqylamidb 
gel electrophoresis (PACE) (Rg. 3B) (iCj. Bee- 
(roptoesis of equimolar (piantities. of mbced 
DN-TNF and native TNF resolved tiie variant 
bc^trimer, l:2and2:l native-variant hetero- 
trimers, and native horaotxinoer hi appiaxiinate- 
ly die expected 1:3J:1 ratio (Fig. 3B, fane 
10:10). Western Mot analyses (14) with anti- 
bodies agamst tiie qiitope tags confimied die 
cQtopositiaa of the mtomediate species (fig. 
S4). Stochastic cquiUbrium modeluig of native 
and variant TNF beCerotrimer assembly predicts 
dot 10-fotd excess of variant homotiiiDet causes 
die loss of more than 99% of homotrimeric 
native TNF, primarily irUo 1:2 native:variant 
heterotrimers, and our results oonfirmed diis 
(Hg. 3B, lane 10: 100). Exchange reactions be- 
tween native and variant TNF reached -80% 
completion at 20 min, and essentially all die 
native homptdmer was deleted after 90 min 
(tig. S5). Finally* we confinned diat biological 
activify of variants lequkes exdiange into het- 
crotTimeiic cooiplexes witii native TNF. Bpoaf- 
kalfy, our most potent variants (e.g., A145R/ 
YS'^ Med to Idock caspase activity induced 
by chemically cross-linked native TKF bomotri* 
mets {I4\ iR^cb are unable to dissociate to 
alkswexdungewiOi variant TNFs (fig. S6). 

The most potent in vitro inhibitors were se- 
lected for testing in vivo, to further stu^y (he 
fflftfihaniffn and to begjn therapeutic lead carkii- 
date identification. We tested the bioactivity of 
variant homotrimer and native:vatiaa( h^etotri- 
Tom in iht D-galactosamine (QaIN)-8ensiti2ed 
mou^ model, which demonstrated dial DN- 
TNF homotrilaaecs, and beterotrimeit with native 
T>n^4 are devoid of agonist activity and dfident- 
ly exchange witii radpgenous TNF in nva 
OalN b a known ^tecific hepato-toxb tiiat can 
increase dw sen^tivify (^mice to human TNF 
bjr lOOO-foW (27, 22). Native human TNF (30 
fig/kg) hiduced severe bepatooellular apoptosis 
and lediality, consistent widi previous reports 
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{2Sy, to oootiast, A [ 45 RA'STH dosed as lugjb as 
30 mg per kUograin of mouse body weight 
(along wilh native TNF at 30 ^g/kg) resulted m 
no iDortaHly or hepafotmdty (Tig. 4A) (24). 
Similarty, Icdial doses of native fautnan THF (30 
li^/kg) mixed before ii^ectkni with vatyiog ra- 
tios of A145R/Y87H pioduced no TNF-induced 
damage. This fnotectioii was deserved at oatiye: 
vaxiani ratios as low as 1:1 and with a suped^' 
thai dose of. TNF (Fig. 4AX FUrther» sandwich 
EUSA analyses of scnnn samples inrlicatwl 
a substantial portion (}0 n^nO of adoiimstBted 
A 145R (3 m^kg) was in hetecotcioiera wtdi die 
cndogeooQS mouse TNF at 1 hour. 

AI4SR/Y87H was next assessed in a 
model of chronic disease as an initial test of 
the DN-TNF antagonism mechanism in a 
disease-relevant setting. We selected the rat 
7-day established coUagen-iaduced arthritis 
.(CIA) model because It simulates chronic 
autoinomune joint disease aiid can be treat- 
ed by TNF blockade (2S), When dosed after 
the onset of symptoms, only interventions 
with r^id onset of action would be.able to 
affect disease progression in this model* 
thus re<{uiring tapid exchange in vivo of 
TNF variants with endogenous TNF. To 
enstne that thete were no confotmding in 
vivo effects of using afdnity-tagged vari- 
ants, we produced A145R/Y87H that 
lacked such tags. Further^ to decrease in 
vivo clearance, we added one polyethy- 
lene glycol (PEO; -^5 IdD/molecule) to each 
monomeric subunit of A145R/Y87H. This 
modification bad no effect on the 
don&inant-negative properties of the mole- 
cule in vitro (fig. S7). AI45R/Y87H re- 
duced joint swilling in the CIA model 
when dosed once daily at 2.0 mg/kg^subcu- 
taneously with a loading dose of 2.0 mg/kg 
and twice daily at 10 mg/kg intravenously 
(Fig. 4B). These results demonstrate the 
potential of DN-TNFs to inhibit TKFHOie- 
diated inflammation and verify that ex« 
change occurs rapidly enough to affect pro- 
gression of acute syn^toms when dosed 
therapeutically. 

Given tbdr high-yickl bacterial production, 
themetical low tnmumogenicily, and unique 
mecftantsm of action^ DN^ThA^s show potential 
as a new class of uiti-infiamniatoiy therapy, 
particulaity because qcisting tnethoddogies (le., 
PEG medication) can be used to fuxtber en- 
haace Hxk phaimacokinetic intiperties (26, 27). 
Fblber, we propose (hat this docninant-iiegaiive 
approadx diould be tested for its potendal to 
create inhibitois of other multimedc extrapeUular 
signaling rnolecules, in psiticular otfter memibcxs 
of the TNF superfemily (eg., RAWU CD40L* 
and BAFF) that have been implicated inhuman' 
padwphy^ology (2^, 
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The Dog Genome: Survey 
Sequencing and Comparative 
Analysis 

Ewen iCMcmss^^ VIneet Bafna,^ Aaron L Halpem«'* 
Samuel Uvy«^ Karln Remington,'* I>ouglas Rusch«^^ 
Arthinr L Deldier/ Mfhai Fop/ Wei Wang,^ Claire K l^ser/ 
J.Cr^Venter' 

A survey of the dog genome sequence (6^ ntiUion sequence reads; 15X 
<overag;e) dennonstrstes the power of sahriple sequencing for comparative 
analysis of mammalian genomes and the generation of spedes-spedfic re- 
sources. Hore tfian 650 mtUion base pairs (>25%) of dog sequence atl^ 
uniquely to the human genome, including fragments of putative orthotogs for 
18.473 ef 24.567 annotated human genes. Mutation rates, conserved synteny, 
repeat content, and phytogeny can l>e compared among human, mouse, and 
dog. A variety of polynKHphic elements are Identified that wlU l>e valuable for 
mapping the genetk basis of diseases and traits in the dog. 



Our undetstandtng of how du human genome 
fonctions in health and disease wQI benefit 
from comparison of its stnictute with die 
genomes of other species (/» 7). The domestic 
dog is a particularly good exaniple, i^iere an 
unusual population structure offen unique 
opportunities for understanding the geoedc 
basis of morphology, be!iavi<»s, and disease 
susceptibility (J. 4). The physical and behav- 
ioral characteristics of ~300 dog *1raccds" 
are maintamed by restricting gate How be* 
tween breeds. Many modem breeds are de- 
rived from few founders and have been m- 
bted for desired characteristics. This has led 
to a species widi enormous phenolypio diver- 
sity, but widi significant homog^ization of 

nhe institute for Cenomlc Researdi. RodcvCle. MD 
ZOflSOL USA. nhe Center for Advmcement of 
Oenomio; lUxtcviile. MD ZpeSO. USA. 
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die gene pAool within breeds. Many of die 
'^SM known genetic disord^ hi dogs re- 
semble human conditions, and their causes 
may be more tractable in large dog pedigrees 
than in small, outbred human families (4« J). 
The combination of genetic homogqiei^ and 
phenotypic divetshy also provides an oppor* 
tunity to understand the genetic basis of many 
coraplcii developmental processes in mam- 
(nab {6). 

Because of the costs of scquendng mam- 
malian gencmies to completion, diese projects 
have been restricted to a few species lhat tie 
considered to be of greatest value to biomed- 
ical research. The decision as to whedier 
fimne projects should aim for complete se* 
qoence coverage of a few more genomes. ^ 
whether die existing *Veference genomes^ 
can be exploited to characterize a wider va- 
riety of goiomes tfiat ate sequenced to < 
lower level of coverage, must be made. Hcr^ 
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Combining computational and experimental screening 
for rapid optimization of protein properties 

Xencor, 1 1 1 West Lemon Avenue, Monrovia, CA 91016 

communicated by Pamela J. Biorkman. California Institute of Technology, Pasadena, CA, October 16. 2002 (received for review August 14, 2002) 



We present a combined computational and experimental method 
for the rapid optimization of proteins. Using /3-lactamase as a test 
case, we redesigned the active site region using our Protein Design 
Automation technology as a computational screen to search the 
entire sequence space. By eliminating sequences incompatible with 
the protein fold. Protein Design Automation rapidly reduced the 
number of sequences to a size amenable to experimental screen- 
ing, resulting In a library of ^200,000 mutants. These were then 
constructed and experimentally screened to select for variants with 
improved resistance to the antibiotic cefotaxime. In a single round, 
we obtained variants exhibiting a 1,280-fold increase in resistance. 
To our knowledge, all of the mutations were novel, i.e., they have 
not been identified as beneficial by random mutagenesis or DNA 
shuffling or seen in any of the naturally occurring TEM /3-lactama- 
ses, the most prevalent type of Gram-negative ^-lactamases. This 
combined approach allows for the rapid improvement of any 
property that can be screened experimentally and provides a 
powerful broadly applicable tool for protein engineering. 

computational protein design 1 protein engineering | mutagenesis | 
directed evolution I /J-lactamase 

The increased use of enzymes and other proteins in the 
chemical, agricultural, and pharmaceutical industries has 
generated considerable interest in the design of proteins with 
new and improved properties. Two different but complementary 
technologies have been applied to this goal: (0 rational design, 
which relies on structural and mechanistic knowledge and hu- 
man expertise; and (//) directed evolution methods such as 
error-prone PGR, phage display, and DNA shuffling, which use 
random mutagenesis or recombination to create diversity and 
then experimentally screen the libraries generated for desired 
properties (1). Directed evolution has been successfully used on 
a wide range of proteins (2-7). However, this approach is limited 
by the number of sequences that can be screened experimentally 
(about 10^^ for library panning and 10^ for high-throughput 
screenmg). Rational design has also been applied with some 
success (8-10), but it was not until computational methods were 
developed that it could be used comprehensively. 

Computational techniques use protein design algorithms to 
perform in silico screening of protein sequences (11-17). By 
taking advantage of the speed of computers, these methods allow 
a vast number of sequences to be screened ('^lO^**). The ability 
to search such large sequence spaces drastically increases the 
possibility of finding novel proteins with improved properties. 
Computational techniques have also been developed to enhance 
the efficiency of directed evolution methods (18, 19). 

One computational design tool that has proven effective is 
Protem Design Automation (PDA) (13). PDA begins with the 
three-dimensional structural model of the protein to be designed 
and predicts the optimal sequence that will adopt this fold, 
allowmg all or a specified set of residues to change. The fitness 
of sequences is scored by using physical potential functions that 
model the energetic interactions of protein atoms (20); stable 
low-energy sequences are given the best scores. By using ex- 
tremely efficient search algorithms, up to lO^^ sequences can be 



accurately screened within hours (21-23). Multiple simultaneous 
mutations can be made, and novel sequences that are very 
different from wild type can be discovered. PDA has shown 
tremendous success in designing proteins with improved stability 
and conformational specificity (13, 14, 24-28) and has even been 
used to engineer a catalytic site into a previously nonreactive 
protein (29). 

In these studies, only a few optimal sequences calculated by 
PDA were made and tested experimentally. The utility of PDA 
can be extended significantly, however, if it is used to generate 
a library of sequences, all of which are predicted to be stable and 
fold into a predetermined structure. Unlike random libraries, 
where most of the mutations are deleterious, the mutant se- 
quences in the PDA library are computationally screened to 
eliminate destabilizing mutations and sequences inconsistent 
with the proper fold. The selected sequences are then experi- 
mentally screened for desired properties such as improved 
catalytic activity, substrate specificity, or receptor binding. 
Therefore, PDA is a computational prescreen to decrease the 
sequence space many orders of magnitude, while maintaining 
broad diviersity, to a number easily amenable to experimental 
screening. By coupling PDA with experimental screening, we 
combine the advantages of computational design with those of 
directed evolution: namely, access to a vast sequence space and 
the ability to improve any protein property that can be captured 
by a screen. 

In this paper, we demonstrate the feasibility of this approach 
by using it to increase the resistance of bacteria toward the 
antibiotic cefotaxime by optimizing TEM-1 /3-lactamase, the 
most prevalent plasma-encoded /3-lactamase in Gram-negative 
bacteria. 

Methods 

Structure Preparation. The crystal structure of TEM-1 jB-lacta- 
mase (Protein Data Bank no. IBTL) (30) was used as the starting 
point for modeling. All water molecules and the sulfate group 
were removed; the side chains of residues N132, N154, N170, 
H122, and H289 were flipped to form a better hydrogen bond 
network; and the disulfide bond between C77 and C123 was 
formed manually. The program biograf (Molecular Simula- 
tions, San Diego) was used to generate explicit hydrogens, and 
50 steps of conjugate gradient minimization were performed by 
using the Dreiding II force field (31) without the electrostatics 
term. The minimization is done to make the structure compat- 
ible with our force-field parameters and results in very slight 
changes to the coordinates. 

Construction of Mutant library. To facilitate introduction of the 
mutations into the TEM-l gene, a pCR-Blunt (Invitrogen) vector 
containing the TEM-l gene was digested with Xbal and //wdlll. 
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Fig. 1. Protein optimization strategy. A set of residues or region of the protein structure is selerted to be designed. PDA is used to computationally screen the 
entire designed sequence space and determine the GMEC for the fold Starting from the GMEC the Monte Carlo search is then used to explore the sequence 
space and generate a list of near-optimal sequences. An amino acid probability table is obtained from the list and cutoffs are applied to define a PDA library 
of mutant sequences for experimental saeening. The PDA library is translated into a DNA library, which is cloned and expressed. An experimental screen Is used 
to select clones with improved properties; these are then characterized. Results can be fed into additional cycles. 



treated with T4 DNA polymerase, and religated. Site-directed 
mutagenesis was performed by using QuikChange as described 
by the manufacturer (Stratagene) to remove the existing Xba\ 
and i/mdlll sites. New Xba\ and HindXW sites were then 
introduced by site-directed mutagenesis at nucleotides 163 and 
841, respectively, of the TEM-l gene in the vector. A pblyhisti- 
dine (6XHis) sequence was then added to the 3'-end of the 
TEM-l ORF to facilitate immunodetection of the proteins, 
thereby creating the vector pXR293. The his-tag was found to 
have no effect on /3-lactamase activity, Escherichia coli TOPIC 
cells transformed with pXR293 were confirmed to grow on 
media containing 100 M-g/nil ampiciilin and 50 p-g/ml kanamy- 
cin. The /3-Iactamase protein expressed from this construct is 
termed TEM-l in this report. 

The mutated ^-lactamase genes were constructed essentially 
as described by Prodromou and Pearl (32) and Chalmers and 
Curnovv (33). Oligonucleotides corresponding to the gene were 
synthesized as 40-50 mers with «*15-nt overlaps. At each mu- 
tational position, multiple oligonucleotides were included in the 
reaction, and the genes were synthesized by using recursive PGR. 
They were then digested with Xbal and Hindlll and subcloned 
into pXR293. This vector was then transformed into E. coli 
TOPIO cells (Invitrogen) for expression. 

Selection of Cefotaxlme*Resi$tant Mutants. £. coli cells expressing 
the mutant library of TEM-l genes were grown on plates 
containing increasing concentrations of cefotaxime, and the 
minimum inhibitory concentration (MIC) for survival was de- 
termined. The cefotaxime concentrations used were: 6.01, 0.025, 



0.05, 0.1. 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, and 128 /xg/ml. Assays 
were conducted at 25 and SO^C to ensure soluble expression of 
the designed proteins. Cells were plated at low density (< 1,000 
per pliate) to ensure that the observed resistance was not due to 
confluence. Clones demonstrating the highest resistance were 
picked, and the P-lactamase protein was identified with immu- 
noblot analysis (34) by using pooled 5- and 6-his polyclonal 
antibodies (Qiagen, Valencia, CA). The TEM-l gene of the most 
resistant variants was sequenced to identify the mutations. New 
genes containing the mutations identified for PDA-1, -2, and -3 
were constructed, and MICs were determined to confirm the 
initial screening results. 

Results 

Protein Optimization Strategy: Combining PDA with Experimental 
Screening. The overall strategy for protein optimization is shown 
in Fig. 1. PDA is used to computationally design a protein and 
define a library of mutant sequences at specific positions. PDA's 
optimization algorithms are then run to screen all possible 
sequences for the global optimal sequence and conformation for 
the target fold, the one with the loNvest energy as determined by 
the scoring function. This conformation is termed the global 
minimum energy conformation (GMEC). Starting from this 
optimal structure, a search algorithm such as Monte Carlo (35, 
36) simulated annealing is used to explore sequence space, and 
generate a list of other liear-optimal sequences. The Monte 
Carlo list is rank-ordered by energy score and may contain as 
many sequences as desired (e.g., the best 1,000). An amino acid 
probability table is then generated from the list by counting 
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Fig. 2. Reduction of sequence space for PDA design of TEM-1 ^-lactamase. 
Computational saeening with PDA and judicious application of cutoffs re- 
duced the sequence space 18 orders of magnitude for the 19 residues explicitly 
considered and more than 300 orders of magnitude for the entire protein. This 
conformational screen specified a library for experimental screening of 
•-200,000 mutant sequences, enriched for structural integrity. 

amino acid occurrences at each of the designed positions. 
Different cutoffs or weighting functions can be applied to define 
a library of a desired size, appropriate for experimental screen- 
ing. Structure or sequence alignment information, experimental 
data, and diversity considerations may also be taken into account 
in defining the mutant library. 

Recursive PCR with overlapping oligonucleotides is then used to 
synthesize the genes containing all of the mutant sequences in the 
PDA-defmed library. The genes are pooled and cloned, and the 
mutant proteins are expressed in an appropriate host such as E, coli. 
The mutant proteins are screened experimentally for desired 
properties, and the best mutants are isolated and characterized. 
These results can be used as feedback for additional rounds of 
computational design, library generation, and screening. 

Reduction of Sequence Space, The use of PDA as a computational 
screen allows us to access a vast sequence space and, by 
eliminating sequences predicted to be destabilizing or inconsis- 
tent with the proper fold, reduce it to a size amenable to 
experimental screening. The reduction of sequence space ob- 
tained for TEM-1 /3-lactamase, our test case, is shown in Fig. 2. 
If we were to consider the entire /3-lactamase protein (263 
residues) and allow all 20 amino acids at each position, we would 
need to screen 202« or ^lA x lO^^a sequences. By focusing the 
design to a particular region (19 residues near the active site) and 
using a slightly restricted set of amino acids (19), we reduced this 
to 7 X 10^ sequences, a number that can easily be screened 
computationally, but not experimentally. We then chose cutoffs 
for the Monte Carlo list and the probability table that would 
define a library within the limits of experimental screening. In 
this case, we specified a library of ^200,000 mutant sequences, 
a reduction of 18 orders of magnitude for the residues explicitly 
considered and an overall reduction of more than 300 orders of 
magnitude for the entire protein. 

Computational Design of ^-Lactamase, The hydrolysis of /3-lactam 
antibiotics, catalyzed by |3-lactamase, is a common mechanism by 
which bacteria become resistant to Antibiotics (37). The most 
prevalent plasmid-encoded j5-!actamase in Gram-negative bacteria 
is the class A TEM-1 )3-lactamase (38). This enzyme hydrolyzes 
ampicUlin efficiently but is inefficient at hydrolyzing the cephalo- 
sporin cefotaxime. Our goal was to use PDA to design j3-lactamase 
variants that confer increased resistance toward cefotaxime. 



Optimizing the area around the active site is likely to have a 
significant effect on enzyme activity and substrate specificity (37, 
39). Although more distant mutations can also be effective (40)] 
the rationale for how to select such positions is less obvious! 
Designing residues around the active site also serves as a 
stringent test of the ability of PDA to predict nondisruptive 
mutations. We therefore focused our design on residues within 
5 A of the active site residues S70, K73, S130, E166, and K234. 
These criteria resulted in 19 positions that were allowed to 
change: M69, T71, F72, V74, V103, Y105, A126, 1127, N132, 
A135, N136, L169, N170, M211, D214, K234, S235, G236, and 
1247. All 20 amino acids, except cysteine and proline, were 
considered at these positions. The catalytic residues (S70, K73, 
S130, and E166) were not allowed to change their amino acid 
identities; however, their conformations could vary. An ex- 
panded version of the backbone-dependent rotamer library of 
Dunbrack and Karplus (41) was used in all of the calculations, 
and the DEE algorithm was used to find the GMEC. The 
computational details, residue classification, and potential func- 
tions used are described in previous work (13, 14, 20, 42). 

Definition of Mutant Library. Optimization with PDA predicted an 
optimal sequence with nine mutations. Starting from this 
GMEQ we applied Monte Carlo simulated annealing to produce 
a rank-ordered list of the 1,000 lowest energy sequences. A 
probability table was generated from this list by counting the 
amino acid occurrences at each of the 19 designed positions 
(Table 1). A 10% cutoff was then applied to the probability table 
to define a library of mutant sequences for experimental screen- 
ing; that is, for a given position, an amino acid identity was 
included in the library if it had a 10% or greater probability of 
occurrence. To ensure that the library spanned the complete 
sequence space from the wild-type enzyme to the most distantly 
related PDA mutant, we always included the wild-type identity 
at ail designed positions, even if it did not appear in the Monte 
Carlo list. With a 10% cutoff, this gave us a library of 172,800 
unique sequences; a 20% cutoff would have resulted in a much 
smaller library of 4,806. 

Construction of Genes for Mutant Library. Recursive PCR with 
overlapping oligonucleotides was used to synthesize the TEM-1 
p-lactamase genes containing all 172,800 mutant sequences in 
the PDA library. Synthetic oligonucleotides containing the 
designed mutations were pooled to create desired diversity at 
each site. Two separate reactions were performed: one that 
contained only a proofreading DNA polymerase (Pfu DNA 
polymerase), termed the nonerror prone reaction, and one that 
contained both Pfu DNA polymerase and To^DNA polymerase, 
termed the error-prone reaction. The mutated genes were 
cloned and transformed into £. coli. 

Validation of Mutant Library. Sbcty individual clones from the 
nonerror-prone library were sequenced by standard techniques. 
The plasmids contained intact ORFs with the desired mutations. 
No additional mutations were detected. With a sample size of 60, 
we \yere able to find all of the specified mutations at each 
designed position. It is impossible to find all combinations of the 
mutations within this small sample (the library contained 
172,800 unique sequences), but none of the clones were identical 
and we were unable to detect a statistically significant bias 
toward any particular mutation at any position. This result 
indicates that we have developed an efficient method for con- 
verting a PDA-defined library into an experimental library 
containing all of the mutated genes required to encode the 
desired mutant sequences. 

Experimental Screen for p-Lactamase Activity. Experimental librar- 
ies of '^SOOjOGO individual E. coli colonies expressing the mu- 
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Table 1. PDA probability table for designed positions of TEM-1 /3-lactamase 

Position WT Amino add probabilities predicted by PDA, % 




Amino acids included when applying different probability cutoffs are indicated as follows: 20% cutoff in dark gray, 10% cutoff 
in light gray, and 1% cutoff in white background. In defining the PDA library, the 10% cutoff was used and the wild-type amino 
acid identity was added if it did not appear in the Monte Carlo list. This specified a library of 172,800 unique sequences. 



tated /3-lactamase genes were pooled and plated onto increasing 
concentrations of cefotaxime in a single round of selection, and 
the MIC for survival was determined. This number of colonies 
is about three times the size of the PDA-defined library and was 
used to ensure that the pooled DNA library contained at least 
one copy of each mutant sequence (within a 95% level of 



confidence) (43). Clones from the nonerror-prone library had a 
MIC of 64 /i-g/ml, which is a 640-fold increase in resistance 
compared with the wild-type value of 0.1 /xg/Ynl, Clones from the 
error-prone library had a MIC of 128 ^g/ml, a 1,280-fold 
increase in resistance (see PDA-1 and -2, Table 2). Because our 
approach allows us to assay the complete PDA library diversity 



Table 2. Antibiotic resistance of TEM-1 /3-lactamase variants 



No. of 



No. of 
novel 



Cefotaxime 



Ampicillin 



Variant 


Mutations 


mutations 


mutations 


MIQ /ig/ml 


Fold increase* 


MIQ ftg/ml 


Fold decrease* 


TEM-1 (VVT) 








0.1 




4.096* 




PDA-1 


M69D* V103Q. Y105N 
L169AN170L 
S235D. G23QS 


8 


8 


64 


640 


100* 


40 


POA'2 


V103Q, Y105N, I127L 
L169A« S235Y. G236S 


6 


6 


128 


1,280 


100* 


40 


POA-3 


M69D. V103Q, N132M 
L169A, N170U S235D 


6 


6 


64* 


640 


NO 


ND 


TEM-1 5 


E104K. G238S 


2 




16 


160 


4,096* 


1 


ST-1 


E104K, G238S, M182T 
A18V 


4 




256 


2,560 


NO 


ND 



Resistance measured at 25*C unless specified otherwise. ND, MIC not determined. 
•Fold increase/deaease In resistance is relative to wild type (TEM-1). 
Value reported by Cantu and Palzkill (44). 
•MIC assay done at 30*C. 

*Value reported by Shannon etai (49). Boldface indicates novel mutations (not reported to significantly improve cefotaxime resistance in any TEM-^-iactamase; 
refs. 2, 45, ar>d 46; iacoby, G. and Bush, K., www.lahey.org/studies/temtable.htm). 
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in a single round, we were able to use very stringent selection 
conditions and directly obtain highly resistant variants. The 
identification of incrementally improved sequences was not 
necessary. 

Substrate Specificity. We also measured resistance to ampicillin 
and found no growth at 100 /tg/ml, significantly less than the 
MIC of 4,096 /xg/ml reported for the wild type (43). This result 
suggests that our screens identified clones whose resistance to 
cefotaxime had dramatically improved, whereas their resistance 
to ampicillin was reduced at least 40-fold. The relative substrate 
specificity toward cefotaxime vs. ampicillin was thus enhanced 
25,000- to 50,000-fold. 

PDA Mutants Are Novel. The most active mutant from each library 
was isolated and sequenced. PDA-1 had eight mutations (M69D 
V103Q, Y105N, N132M, L169A, N170L, S235D, and G236S), all 
designated in the PDA library (see Table 1), PDA-2 had five 
PDA-designed mutations (V103Q, Y105N, I127L, L169A, and 
G236S) and one random mutation (S235Y). The S235Y mutation 
was not predicted by PDA due to steric clashes. Protein backbone 
motion, which is required to relieve the clash, is not considered in 
the computation. None of the mutations in PDA-1 or -2 have been 
identified by foil gene random mutagenesis or DNA shuffling 
studies (2, 39, 45, 46) or have been observed in the 105 naturally 
occurring TEM /3-lactamases (G. Jacoby and FL Bush, www.lahey. 
org/studies/temtable.htm). Orencia et aL (47) discussed the emer- 
gence of antibiotic resistances in /3-lactamases and showed that 
there is an overlap between the mutations discovered by directed 
evolution and those occurring in natural evolution. PDA, however, 
accesses the entire designed sequence space including all possible 
combinations of mutations and therefore can produce multiple 
simultaneous mutations. PDA is therefore more likely to identify 
novel mutants with desired properties. The lone random mutation 
in PDA-2 (S235Y) was in the active site region, suggesting that the 
novel context of the PDA-designed mutants allowed this previously 
unobserved, but beneficial, mutation to emerge. 

Two of the mutations in PDA-1 (V105N and G236S) were 
reverted to wild type to create a backcross mutant (PDA-3). This 
PDA-3 sequence is present in the library defined with a 20% cutoff 
but is absent if a 10% cutoff is used (see Table 1). PDA-3 exhibited 
the same cefotaxime resistance as PDA-1 (Table 2), indicating that 
a smaller PDA-library (4,806 vs. 172,800 sequences) can also 
generate mutants with significantly improved activity. Additional 
backcrosses were done to examine the role of the other six muta- 
tions in PDA-1. No single mutation was primarily responsible for 
the improved resistance, and no simple additivity was apparent, 
suggesting that the mutations are coupled. This conclusion is 
supported by extensive replacement mutagenesis studies of three- 
residue segments around the active site (39). They found a mutant 
(E168G, L169A, and N170G) that included one of our mutations 
(L169A), but it showed only a marginal (2-fold) improvement in 
cefotaxime resistance. Although they also tested most of our other 
mutations, no increased resistance was found for any of these. This 
lack of improved resistance indicates that the broader context of 
many simultaneous mutations provided by our approach was 
required to find our highly active sequences. 

Comparison with Other Mutants. To compare the activity of our 
PDA-designed mutants with those obtained in other studies, we 
introduced some previously reported mutations into our wild- 
type gene, including E104K/G238S (comparable to TEM-15) (2, 
46) and A18V/E104K/M182T/G238S (comparable to ST-1) (2, 
46). TEM-15 is a naturally occurring /3-lactamase that is active 
against cefotaxime, and ST-1 is a highly active TEM-1 variant 
discovered from three rounds of DNA shuffling. We tested the 
ability of these mutants to confer resistance to cefotaxime. Wild 
type had a MIC of 0.1 ^g/ml, comparable to the values reported 




Fig. 3. Location of mutations in PDA-1 and -2 (green) vs. those obtained by 
DNA shuffling (2) and random hypermutagenesis (46) (magenta). The wild- 
type TEM-1 ^-lactamase structure is illustrated, and the side chains of the 
mutated positions are shown. The catalytic serine (S70) is depicted in blue. The 
average distance between the C„ atoms and the catalytic nucleophile (O^S70) 
in our Pp A-1 and -2 mutations was 8.0 and 8.6 A, respectively, vs. 1 6.0 A for the 
mutations in ST-2 and -3 (Stemmer's best mutants) (2) and 12.1 A for 3D. 5 
(Zaccolo and Gherardi s best mutant) (46). This difference in distances illus- 
trates that the mutations found by PDA are near the designed active site area, 
whereas those found by DNA shuffling and random hypermutagenesis are 
farther away. 



by others; TEM-15 and ST-1 had MICs of 16 and 256 /xg/ml, 
respectively, also in line with previously published work (Table 

2) (2, 44, 46, 48-50). 

Location of Mutations. The mutations in all our variants are 
located in or near the active site, because our computational 
design restricted changes to this region. Directed evolution 
methods, however, tend to produce mutations spread over the 
entire protein structure. For example, almost all of the mutations 
in the best mutants obtained by DNA shuffling (2) and random 
hypermutagenesis (46) are located far from the active site (Fig. 

3) . It is possible that these techniques seldom produce mutations 
close to the active site, because they rely on incremental changes; 
a single change in the first round of screening must be beneficial 
to be passed to the second round. However, point mutations in 
the active site area are usually disruptive. Our approach, on the 
other hand, allows multiple simultaneous mutations in a single 
round, which can have compensating or even synergistic effects. 

Sequence Space Coverage. Most of the mutations observed in our 
PDA variants require a minunum of two nucleotide changes, and 
one, M69D, can be made only by a triple nucleotide change (Table 
3). Double- or triple-nudeotide changes within a single codon are 
very difficult to achieve by using random mutagenesis techniques 
such as error-prone PGR or single-gene DNA shuffling. This 
limitation i5 demonstrated by the fact that each of the mutations 
found in the directed evolution studies (2, 45, 46) as well as those 
observed in the 105 naturally occurring TEM variants (G. Jacoby 
and K. Bush, www.lahey.org/studies/temtable.htm) could be ob- 
tained by a single nucleotide change. If one consider all of the 
substitutions that are possible for each of the 20 amino acids, on 
average only seven can be achieved by a single nucleotide change. 
The sequence space coverage is further reduced by codon prefer- 
ences, biases for transitions over transversions, and A T over 
G ^ C mutations. These restrictions severely limit the sequence 
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Table 3. Minimal number of nucleotide changes required for amino acid mutations in TEM-1 /^-lactamase variants 

Position 



Variant 


18 


42 


69 


92 


103 


104 


105 


127 


132 


169 


170 


182 


235 


236 


238 


240 


241 


254 


TEM-1 (WT) 


A 


A 


M 


G 


V 


. E 


Y 


1 


N 


L 


N 


M 


S 


G 


G 


E 


R 




TEM-1 5 












K.1 


















5,1 








ST-1* 


V,1 










K,l 












T.I 






5,1 








ST-2*^ ST-3* 








S,l 




K,l 












TJ 






5,1 




H,l 




ST-4* 












K,1 












T,1 






5,1 








3D.5»* 












K,l 












T,1 






5,1 








3A.6" 






























5,1 




ai 


G,1 


PDA-1 










Q.2 




N,1 




M,2 


A,2 


L,2 




0.2 


5,1 










PDA-2 




















A,2 






Y,1 












PDA-3 






0,3 




Q.2 








M,2 


A,2 


U2 




0,2 













♦Stemmer (2). 

*Has additional silent mutations. 

*Zaccolo and Gherardi (46). Mutations requiring two or more nucleotide changes are shown in botd. 



space accessible to these methods and suggest why our approach, 
which does not suffer from these limitations, is more likely to 
produce novel functional sequences. 

Discussion 

The purpose of this study is to show that a combined approach, 
using PDA as a computational screen to rationally reduce the 
sequence space before experimental screening, can rapidly lead 
to novel protein variants with improved properties. We used 
PDA to identify sequences compatible with the protein fold and 
then experimentally screened the resulting sequence library to 
obtain variants with novel properties. As a test case, we rede- 
signed the active site of ^-lactamase and then selected for 
variants with improved resistance to cefotaxime. In a single 
round, we obtained variants that exhibit a 1,280-foid increase in 
resistance and, to our knowledge, are novel. 

Our approach has some key distinctions from purely experimen- 
tal techniques. By using an efficient computational screen, we are 
able to access an extremely large region of sequence space and 
rapidly reduce it to a number amenable to experimental screening. 
Experimental libraries will always be restricted to sampling a 
miniscule portion of sequence space due to limitations on the sheer 



mass of proteins that can be physically made and tested. Cbmpu- 
tational screening, on the other hand, is scalable and can compre- 
hensively screen enormous sequence spaces. The sequence space 
searched by random inutagenesis techniques is also severely limited 
by the fact that amino acid mutations requiring more than single 
nucleotide changes are extremely unlikely. Random techniques 
usually produce only incremental changes per round and require 
multiple rounds, whereas our approadi creates multiple simulta- 
neous mutations in a single round. All of these features result in the 
rapid discovery of novel mutants that are different from any of 
those observed previously. 

Another difference with the PDA method is that it offers full 
control over the location and type of mutations, allowing incor- 
poration of structural and experimental data and enhancing our 
understanding of structure-activity relationships. It results in 
focused designs of smaller functionally enriched mutant librar- 
ies, allowing complex and expensive screens such as mammalian 
cell-based assays for improved protein therapeutics, while still 
accessing broad sequence diversity. Our combined computa- 
tional and experimental approach allows for the rapid improve- 
ment of any protein property that is amenable to experimental 
screening and has broad applications in the chemical, agricul- 
tural, and pharmaceutical industries. 
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Abstract 



Granulocyte-colony stimulating factor (G.CSF) is used worldwide to prevent neutropenia caused by high- 
dose chemotherapy. It has limited stability, strict formulaiion and storage requirements, and because of poor 
oral absorption must be administered by injection (typically daily). Thus, there is significant interest in 
devclopmg analogs with improved pharmacological properties. We used our ultrahigh throughput compu- 
tational screening method to improve the physicochemicaJ characteristics of G-CSF. Improving these 
properties can make a molecule more robust, enhance its shelf life, or make it more amenable to alternate 
delivery systems and formulations. It can also affect clinically important features such as pharmacokinetics. 
Rcsiducis in the buried core were selected for optimization to minimize changes to the surface, thereby 
maintaining the active site and limiting the designed protein^s potential for antigenicity. Using a structure 
that was homology modeled from bovine G-CSF, core designs of 25-34 residues were completed, coire- 
sponding to 10 -10 sequences screened. The optimal sequence from each design was selected for 
biophysical characterization and experimental tcsUng; each had 10-14 mutations. The designed proteins 
showed enhanced thermal stabilities of up to 13°C, displayed five- to 10-foId improvements in shelf life and 
were biologically aciive in cell proliferation assays and in a neutropenic mouse model. Pharmacokinetic 
studies m monkeys showed that subcutaneous injection of the designed analogs results in greater systemic 
exposure, probably attributable to improved absorption from the subcutaneous compartment. These results 
show that our computational method can 'be used to develop improved pharmaceuticals and illustrate its 
utility as a powerful protein design tool. 

Keywords: Protein design; computational screen; stability; cytokines; granulocyte-colony stimulating 



Many techniques have been used in the design of new and 
improved proteins. In vitro directed evolution methods such 
as phage display. DNA shuffling, and error-prOne PCR are 
widely used. Rational design approaches continue to be ap- 
plied, and strategies that combine both are now being used. 
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Successful designs include enzymes (Chen and Arnold 
I99I; Stemmer 1994; Zhao ci al. 1998) and other proteins 
(Crameri et a!. 1996), as well as therapeutically useful pro- 
teins such as hormones and cytokines (Lowman and Wells 
1993; Heikoop et al. 1997; Grossmann et al. 1998; Chang et 
al. 1999). The experimental techniques involve the genera- 
tion and screening of libraries of random protein sequences. 
However, the number of sequences that can be screened ex- 
perimentally is limited (about lO'* for library panning and 10^ 
for high throughput screening). Libraries of this size allow for 
the simultaneous modification of only about 10 residues. 
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Computational stabilization of cytokines 



Computational methods have also been used that perfonn 
in silico screening of protein sequences (Hellinga and 
Richards 1994; Dcsjarlais and Handel 1995; Dahiyai and 
Mayo 1996. 1997a; Street and Mayo 1999; Jiang et al. 2000; 
Kraemcr-Pecore el al. 2001; Pokala and Handel 2001). Ex- 
ploiting the efficiency and speed of computers, these meth- 
ods can randomly screen a vast number of sequences (up to 
\Q^\ allowing for the simultaneous consideration and 
modification of more than 60 residues. Searching such large 
sequence spaces drastically improves the possibility of find- 
ing novel protein sequences with improved properties. 

Investigators have recently developed a computational 
screening method that finds the optimal sequence for a de- 
fined three-dimensional structure, allowing all or part of the 
sequence to change (Dahiyat and Mayo 1996). This method, 
termed Protein Design Automation (PDA), scores the fit of 
sequences to the three-dimensional structure using physical- 
chemical potential functions that model the energetic inter- 
actions of protein atoms, including steric, solvation, and 
electrostatic interactions. PDA couples these potential func- 
tions with a highly efficient search algorithm to accurately 
screen up to 10*° sequences. Because the screening is per- 
formed in silico, multiple simultaneous mutations can be 
made, and novel sequences that are very different from wild 
type can be discovered. The method has been validated by 
numerous experimental tests and has resulted in the design 
of new proteins with improved stability and conformational 
specificity, and novel activity (Dahiyat and Mayo 1996, 
1997a; Malakauskas and Mayo 1998; Strop and Mayo 1999; 
Shimaoka et al. 2000; Bolon and Mayo 2001 ; Marshall and 
Mayo 2001). 

PDA also has the advantage of being able to control the 
location and type of mutations. For example, the design can 
be limited to the hydrophobic core. Mutations in the core 
can produce significant improvements in protein stability 
but do not change binding epitopes on the surface of the 
molecule. Thus, the molecular surface can be kept identical 
to the native structure, retaining biological activity and lim- 
iting toxicity and antigenicity. This feature is particularly 
important in the design of therapeutic proteins. 

We wanted to take advantage of these features of PDA 
and explore its utility in the design of improved pharma- 
ceuticals. We therefore used PDA as an ultrahigh through- 
put screen for improved analogs of a therapeutic protein, 
granulocyle-colony stimulating factor (G-CSF). G-CSF is a 
hematopoietic growth factor of 174 residues that induces 
differentiation and proliferation of granulocyte-committed 
prpgenitor cells. It is used clinically to treat cancer patients 
and alleviate the neutropenia Induced by high -dose chemo- 
therapy. G-CSF belongs to the class of long-chain four- 
helix bundle cytokines that bind asymmetrically to homodi- . 
meric complexes of cell-surface receptors to initiate an in- 
tracellular signaling cascade. Their stnictural similarity 
allows the design strategy chosen for G-CSF to be imme- 



diately applicable to the other four-helix bundle cytokines 
(human growth hormone, erythropoietin, the interleukins, 
and interferon-a/p — all clinically important comjpounds) 
and thus broadens the potential impact of the results. 

Although the cytokines are functionally very efficacious, 
their pharmacological properties are not ideal. For example, 
G-CSF, like most proteins, is not absorbed orally to any 
significant extent and must be administered by frequent 
(daily) injections throughout the course of treatment. It also 
has limited stability and strict formulation and storage re- 
quirements, including the need to be kept refrigerated. Thus, 
there is significant interest in developing analogs with im- 
proved pharmacological properties. 

We sought to use PDA to improve the physicochemical 
characteristics of G-CSF. Improving these properties can 
make a molecule more robust, enhance its shelf life, or 
make it more amenable to use in alternate delivery systems 
and formulations. It can also affect clinically important fea- 
tures such as pharmacokinetics and result in a drug that is 
safer for human use. Our design strategy was to optimize the 
core to improve the stability and solution properties of 
G-CSF while preserving reciepior binding and biological 
activity. 

The template structure used for in silico screening was a 
homology model of human G-CSF in which the human 
sequence was mapped onto bovine G-CSF. We designed 
several novel core sequences, cloned and expressed them, 
characterized their stabilities, tested them for functional ac- 
tivity both in vitro and in vivo, and studied their pharma- 
cokinetics in monkeys. The designed proteins showed en- 
hanced thermal stabilities, displayed five- to 10-foId im- 
provements in shelf life, and were biologically active both 
in cell proliferation assays and in a neuuopenic mouse 
model. Subcutaneous injection of the most stable variant in 
monkeys also resulted in greater systemic exposure, prob- 
ably altribuuble to improved absorption from the subcuta- 
neous compartment. These results indicate that PDA has 
great potential as a powerful in silico tool in the design of 
improved pharmaceutical proteins. 

Results and Discussion 
Homology modeling 

The crystal structure of bovine G-CSF (PDB record Ibgc) 
(Lovejoy et al. 1993) was used as (he starting point for 
modeling because the crystal structure of human G-CSF 
(PDB record Irhg) (Hill et al. 1993) is at a lower resolution 
and is missing key fragments, including a structurally im- 
portant disulfide bond between positions 64 and 74. Bovine 
G-CSF is a good model for human G-CSF because the 
sequences are the same length and 142 of 174 amino acids 
are identical (82%). The residues that differ in the bovine 
sequence were replaced with the human residues for those 
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positions, and the conformations of the replaced side chains 
were optimized using PDA. Most of ihe replaced residues 
were solvent exposed, thereby introducing little strain into 
the stnjcture and allowing typical PDA parameters to be 
used for conformation optimization. One substitution, how- 
ever, was at a buried site. G167V. and clashed sterically 
with a nearby disulfide bond. To accommodate the larger 
Val, the side-chain conformation at this position was opti- 
mized using a less restrictive van der Waals scale factor (0,6 
instead of 0.9). The entire structure was then briefly mini- 
mized to relax the strain. The final structure that served as 
the template for all the designs is shown in Figure 1 . 

Core designs 

Unlike many experimental sequence screening methods. 
PDA allows control over which residues are allowed to 




@9 Re<idu*« Identical ii UMno orvt ftutnah ftoquonoM (8216) 
Si RAikfim that (l3fer;«eptsoM by twiitan rosicW 
MH Ff8otnents.nU9$lngh human oysUActruohw 
SB SMtt chains pot vtsvnble to crystal ttnjcbim:.rQptaciid 
fcywM tyiw side chehf ttstnaPtHA** 

% Ktf^ residues b«Re.ved^baVfibbrtami^^ 
ecdvfty 

Fig. 1. Template structure of hO-CSF used for PrcHein Design Automation 
(PDA) designs. The human sequence wis homology modeled onto the 
bovine crystal stwcture (PDB record Ibgc), TTie residues thai differ in the 
bovine sequence or were not present in the bovine crystal structure were 
replaced with the residues from the human sequence. Tlie conformations of 
the replaced side chains were optimized using PDA (the larger Val at 
position 167 was optimized using a less restrictive van der Waals scale 
factor), and the entire smicture was energy minimized for 50 steps. 



change. Core residues were selected because optimization 
of these positions can improve stability yet minimize 
changes to the molecular surface, thus limiting the designed 
protein's potential for antigenicity. Ala scanning studies of 
G-CSF indicate one or two binding sites on the protein 
surface that are probably responsible for granulopoietic ac- 
tivity (Reidhaar-OIson et al. 1996; Young et al. 1997) (Fig. 
1). Although recent crystallographic studies of G-CSF com- 
plexed to its receptor show only one binding site in a novel 
2:2 complex (Horan et al. 1996; Aritomi et al. 1999), both 
sites were avoided in the core designs to ensure preservation 
of function. 

Two PDA design calculations were run: a deep core de- 
sign that included residues deeply buried in the interior of 
the protein and an expanded core design (exp_core) tfiat 
also included less buried peripheral core residues. The deep 
core design had 26 core positions that were allowed to vary 
(shown yellow and gold in Fig. 2). whereas cxp^core had 34 
(shown yellow and turquoise in Fig. 2). Only hydrophobic 
amino acids were considered at the variable core positions. 
These included Ala, Val, He, Leu, Phe, Tyr» and Trp. Gly 
was also allowed for the variable positions that had Gly in 
the bovine wild-type structure (positions 28. 149. 150, and 
167). Met and Pro were not allowed. 

Optimal sequences 

The optimal sequences selected by PDA are also shown in 
Figure 2. The optimal sequence from the deep core design 
had 10 mutations (named core 10), and the optimal exp core 
sequence had II (named exp_corelJ); thus, 33%-38% of 
the variable residues changed their identities. Eight of the 
mutated positions changed to the same amino acid in both 
designs. Changing the set of design positions can signifi- 
cantly impact the amino acid selected at a given position. 
For example, in the deep core design. Leu89 retains the 
same amino-acid identity and conformation as wild type. 
However^ in the cxp^core design, when Leu92 is also al- 
lowed to vary, both positions (Leu89 and Lcu92) mutate to 
Phe. indicating a coupling between these two core residues. 
The modeled stnicture of the sequence selected in the deep 
core design (core 10) is shown in Figure 3. 

Native human G-CSF (met hG-CSF) and the optimal se- 
quence from each of the core designs were cloned, ex- 
pressed in Escherichia colt, and purified for experimental 
studies. 

Thermal stabiliry 

The far-ultraviolet (UV) circular dichroism (CD) spectra for 
met hG-CSF and the designed proteins were nearly identical 
to each other and to published spectra for met hG-C^SF 
(Reidhaar-Olson et al. J996; Young et al. 1997). indicating 
highly similar secondary stoicture and tertiary folds (data 
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Rg. 2. Sequences of bG-CSF analogs. Native buman and bovine sequences arc shown al the top. Jht fnigmcnis missing in the crystal 
structure of the human sequence are shown boxed. Variable positions arc colored. The deep core design had 26 variable positions, 
exp.core had 34, and cortl67V had 25. The optimal sequence from each design is shown. Utters indicate core fwidues that mutated 
rclaUve to native hG-CSF; Wanks indicate no change. Posirions that changed to the same amino acid in all three core designs are 
indicated in bold. Corc2 and coreS sequences went not obtained from PDA calculations but were derived by reverting some of the 
core 10 mutations lo wiW type. Melting temperatures (T^s) obtained for the dwigned proteins are also shown. 



not shown). Themial denaturation was monitored at 222 
nm, and the melting temperatures (T^s) were derived from 
the derivative curve of the ellipticily at 222 nm versus tem- 
perature (Fig. 4); Theimal denaturation of G-CSF and its 
variants is irreversible; however, can be used to quickly 
assess the relative stability of different mutants. Stability 
under storage conditions, which is more relevant clinically, 
was evaluated with shelf-life studies (see below). 

The T„ for met hG-CSF was 60*C. identical to that re- 
ported in other studies (Kolvenbach el al. 1997). Corel 0 
showed an increase in stability of 13*C, whereas the T„ of 
cxp_corel 1 was very similar to wild type (Fig. 2 and Fig. 4), 
The increased stability seen with core 10 may be attributable 
to improved packing interactions and optimized hydropho- 
bic burial of side chains. Other possibilities include de- 
creased aggregation resulting from elimination of the free 



cysteine at position 17. The Gly to Ala mutation at position 
28 caused a significant improvement in helical propensity 
that could also be the source of (he improved stability. 



!d€n(ifying critical mutations using derived sequences 

To differentiate between these possibilities, two additional 
sequences derived from the core 10 mutant sequence were 
made and their T^s. meaisured. One of these (coreS) was 
identical to core 10 except that two mutations distant from 
the others were reverted to wild type (L103V and VI 101). 
These were the two positions that did not mutate in 
.exp.corcll. The T„ of coreS was 70°C, similar.lo corelO, 
indicating that the mutations at 103 and 110 were not re- 
sponsible for Corel 0*$ improved stability. 
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were idencical lo wild type (Fig. 2). The of core2 was 
5**C higher than wild type, indicating thai improvements in 
helical propensity and the elimination of a free cysteine are 
imponant for heightened thermostability. The remainder of 
the increase in T^ seen for corelO may be attributable to 
improved packing interactions and increased hydrophobic 
burial. 




Vatiadtte resMues (26) 
sat SMe cttains of mtnatotf re:ikfom (10). 
1^ wad (ypo ?idtt etialiw or mufotAd rw;^tMw 

Fig. 3, Modeled sinicturc of hG-CSF analog (core 10) obtained from deep 
core design. Twenly-six core residues were allowed to vaiy; computational 
scffcening with PDA resulted in 10 mutations: C17L, G28A, L78F Y85F 
L103V, VI 101, F1I3U VI5II, Vi53I.andU68F. 



To determine the importance of the other mutations, an- 
other sequence was made (core2) that contained only two of 
the core 10 mutations, G28A and CI7A, all other residues 



Storage stability 

Increased shelf life is important for distribution and storage 
and is a desirable feature for G-CSF and other protein drugs. 
Because aggregation and chemical degradation are the pre- 
dominant mechanisms of inactivatioii of G-CSF (Herman el 
al. 1996), shelf life was estimated by incubating the proteins 
at elevated temperature and then using size-exclusion chro- 
matography to observe the disappearance of monomeric 
protein. Chemical degradation was estimated using reverse 
phase chromatography (data not shown). Core2 and core 10 
showed five- and -1 0-fold improvements in storage stability, 
respectively, at 50'C (Fig. 5). Rate constants were deter- 
mined by a first order exponential fit of the fraction mono- 
mer remaining/lime curves using KaJeidaGraph (Synergy 
Software). 



piological activity 

Granulopoietic activity was determined in vitro by quanti- 
lating cell proliferation as a function of protein concentra- 
tion in murine lymphoid cells transfected with the gene for 
the human G-CSF receptor. The designed proteins were as 
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Pig. 4. Thermal stability of hG-CSF analogs. Thermal stability was as- 
sessed by monitoring the temperature dependence of the circular dichroism 
spectral signal at 222 nm. Melting temperatures <T„$) were derived from 
the derivative curve of the elUpticity at 222 nm versus temperature. CorelO 
and core2 showed increases in T,„ oflSX and respectively, over 
native met hG-CSF. 
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Fig. 5. Shelf life of hG-CSH analogs. Shelf life was estimated by incubat- 
ing the proteins at elevated temperature (50X) and using size' exclusion 
chromatography lo observe disappearance of monomeric protein. Rate con- 
sUnts were determined by a first order exponential fit of the fraction 
monomer remaining/time curves. Cbre2 and corelO showed five- and 10- 
fold improvemenu in storage stability, respectively, over met hG-CSF 
controls. 
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Fig. 6, In vivo granulopoietic activity of bG-CSF analogs. Mice were 
rendered neutropenic with a single intrapcritooeal injection of 200 mg/kg 
cyclopbospbamide (CPA). Beginning 24 h later and for 4 consecutive days, 
the mice were given a daily intravenous injection of 100 p^g of native 
hO-CSF (filgrastim, Amgen). an hG-CSF analog, or saline. On day 5, 
granulopoietic activity was detemiined by counting the number of white 
blood cells and polymorphonuclear neutrophils (PMN). The designed ana- 
logs (cores and core 10) were as effective as controls in eliciting a granu- 
lopoietic response. 

active as wild-type hG-CSF (data not shown). The designed 
analogs were also as effective as wild type in increasing 
white blood cell and polymorphonuclear neutrophil levels in 
the neutropenic mouse (Fig. 6). Neutropenia, characterized 
by an abnormally low level of neutrophils in the blood, was 
induced by injection of cyclophosphamide. Reversal of this 
effect by the designed analogs shows (hat granulopoietic 
activity was also retained in vivo. 

Pharmacokinetics 

The pharmacokinetics of corelO and native hG-CSF (fil- 
grastim, Amgen) was studied in cynomotgus monkeys after 
a single subcutaneous or intravenous injection of 5 )xg/kg 
and after daily subcutaneous injections of 5 M-g/kg for 28 d. 
Analysis of the serum concentration-time curves shows that 
subcutaneous injection of the designed analog results in 
greater systemic exposure (area under concentration-time 
curve, AUG) than the same dose of wild-type hG-CSF (Fig. 
7B). This was true after a single dose on day 1 (78-8 vs. 54.6 
ng^h/mL. data not shown), as well as on day 28 (37.2 vs. 
17.4 ng-h/mL). There were no measurable differences in 
serum half-life. In (he intravenous study, however, the half- 
life of core 10 was three-fold shorter (1 vs.. 3 h), and the 
AUG was significantly less (54.7 vs. II 7. 4 ng-h/mL), indi- 
cating that core 10 is cleared faster (Fig. 7 A). Taken to- 
gether, these data indicate that the designed analog is ab- 
sorbed more quickly from the subcutaneous compartment 
(absorption could not be measured directly given the small 
number of data points at early times). Improved absorption 
may be attributable to decreased aggregation or association 
of the designed protein. The increased monomer lifetime 
and decreased aggregation seen in our shelf-life studies and 



the improved (hermal stability of the native conformation 
observed for core 10 indicate a decrease in aggregation in 
the subcutaneous compartment This possibility is sup- 
ported by the fact that other protein therapeutics engineered 
for reduced aggregation also show faster absotption rates. 
For example, insulin Lispro and other rapid-acting insulin 
analogs thai were designed to decrease their tendency to 
self-associate are absorbed faster than regular insulin after 
subcutaneous injection (Howey cl al. 1994; Home et al. 
1999). 
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Fig. 7. Phamiacokinctics of hO-CSF analogs. Plasma concentrations of a 
designed hG-CSF analog or wtld-type hG-CSR (filgrastim. Amgen) were 
determined after administralion in cynomolgus monkeys. (A) Animals were 
given a single intravenous injection of 5 pg/kg isr (B) daily subcutaneous 
injections of 5 [L^g for 28 d. Noncompaitmenta] analysis of the serum 
concentration-ttme curves shows that subcutaneous ii^ections of the core 10 
analog tesultcd in greater systemic exposure (area under conccntralion- 
tinK curve. ADC) (ban the same dose of wild-type hO-CSF, whereas there 
was no change in serum half-life (1,^). In the intravenous study, (he AUC 
was significantly less and (he l,„ three-fold shorter, indicating thiat core 10 
was cleared faster. 
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Comparison to published G-CSF variants 

In vitro and casse(te mutagenesis studies have shown that 
alterations of the N-terminal region of G-CSF can lead to 
improved granulopoietic activity (Kuga ct al. 1989; Okabe 
ct al. 1990). Point frtutations at Cysl7 have also been found 
to affect shelf life; replacement with Ala led to an increase, 
Ser had no effect, and large residues (He, Tyr, Arg) led to a 
decrease (Ishikawa el al. 1992). In contrast, our corelO se- 
quence, which has a large residue (Leu) at this position, 
showed an improved shelf life. This may be explained by 
the observation that in a CysI7Leu point mutant, Leu*s side 
chain would clash with the aromatic ring of the nearby Phe 
at position 1 13. This steric clash does not occur in core 10, 
however, because the Phe at 11 3 is replaced by Leu and, in 
compensation for this change, two nearby Lcu*s become 
Phe's (at positions 78 and 168). Thus, multiple mutations 
allow complementary repacking of the hydrophobic core in 
the core 10 mutant and may be responsible for its enhanced 
stability and shelf life. 

Significant improvements in thermal stability were also 
observed when the seven helical Gly residues in G-CSF 
were replaced with Ala to form point, double, and triple 
mutants (Bishop et al. 2001). Substitutions at positions 26, 
28, 149, and 150 were the most effective. The investigators 
atu^ibuled the stabilizing effect to the enhancement in a-he- 
lical propensity associated with the Gly/Ala substitutions. 
These data support our suggestion that the heightened ther- 
mal stability seen with our mutants (which also contain a 
Gly/Ala substittition at position 28) is at least in part attrib- 
utable to an improvement in helical propensity. 

Probing the robustness of PDA with 
a homology modeled core position 

As pointed out previously, the homology modeling of hu- 
man G-CSF onto the bovine structure was straightforward 
for the most part because the replaced residues were prima- 
rily solvent exposed and no rearrangement of the backbone 
was necessary. The change at one core position, however, 
GI67V, induced a steric clash and energy minimization of 
the entire protein was used to relieve the strain. We decided 
to assess the impact of this manipulation by .doing an addi- 
tional design (core 1 67V) in which the variable residues 
were essentially the same as in the deep core design except 
that position 167 was also allowed to vary. We found that 
Vall67 mutated fo Ala (the other mutations were essentially 
the same as for corelO). To probe the plasticity of the core, 
instead of using this PDA optimal sequence, which only had 
two mutations in this region, we ran experiments on another 
high-scoring sequence (corcl4_.VI67A) that had additional 
mutations (14 total, including L157I, F160W, and LI61F). 
This sequence was chosen because it balanced an extensive 
number of mutations with a relatively high design score. 



Although it ranked 21st in the sequence energy list and was 
2 kcal/mole less favorable than the optimal sequence, it was 
still biologically active and as stable as wild type (T„ of 
6rC) (Figs. 2, 4). This indicates that optimization with 
PDA is fairly robust, and that the protein core can be quite 
plastic and can accommodate large changes without sacri- 
ficing stability or function. 



Conclusions 

PDA is a powerful ultrahigh throughput computational 
screening method. Its ability to screen up to 10*° sequences 
and allow multiple simultaneous mutations significantly in- 
creases the likelihood of finding new and improved prx>- 
teins. In this study, PDA was used to develop improved 
analogs for a therapeutically important protein, hG-CSF. 
The novel proteins showed enhanced thenmal stabilities and 
shelf life while retaining biological activity. Analysis of the 
mutants and results obtained with derived sequences indi- 
cates that the heightened stability is attributable to improve- 
ments in helical propensity and the elimination of a free 
cysteine; improved core packing and optimized hydropho- 
bic burial of side chains may also be important. Phamiaco- 
kinetic studies indicate that subcutaneous injection of the 
most stable variant results in greater systemic exposure, 
probably attributable to improved absorption from the sub- 
cutaneous compartment. 

These results show that PDA can be successfully applied 
to proteins of therapeutic interest. They also illustrate the 
value of Its precise control over the site and type of muta- 
tions, allowing for the rational design of desired properties 
such as improved stability and pharmacokinetics and the 
elimination of undesirable ones such as toxicity and antige- 
nicity. These features are particulariy important in the de- 
sign of therapeutic proteins. PDA thus has great potential as 
a powerful in silico tool for therapeutic protein design. 

Materials and methods 

Template structure preparation 

The template structure fpr the designed proteins was produced by 
homology modeling using the crystal structure of bovine G-CSF 
(Brookhavcn Protein Data Bank code Ibgc) as the starting point. 
The program BIOORAF (Molecular Simulations Inc.. San Diego, 
OA) was used to generate explicit hydrogens on the structure, 
which was th6n minimized for 50 steps using the conjugate gra- 
dient method and the Drciding II force field (Mayo ct al. 1990). 
The residues that differ in the bovine sequence or were not present 
in the bovine crystal structure were replaced with the human resi- 
dues for those positions. The conformations of the replaced side 
chains were optimized using PDA (Dahiyat and Mayo 1997a,b), 
and the entire structure was minimized again for 50 steps. This 
minimized structure was used as the template for all the designs. 
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Protein design 

Analogs of hG-CSF were designed by simultaneously optimizing 
residues in the bun'ed core of the protein using PDA, The compu- 
tational details, residue classification^ potential functions, and pa- 
rameters used for van der Waals interactions, solvation, and hy- 
drogen bonding are described in previous work (Dahiyat and Mayo 
1996, 1997a). An expanded version of the backbone-dependent 
rotamer library of Dunbrack and Karplus (Dunbrack and Karplus 
1993) was used in all the calculations. The global optimum se- 
quence from each design was selected for characterization and 
experimental testing, except for core 167V in which the 21 si ranked 
sequence was used. Calculations were generally performed over- 
night using 16 processors of an SGI Origin 2000 with 32 R 10000 
processors running at 195 MHz. The length of the runs varied from 
I to several hours of CPU time. 



Cloning and expression 

A gene for met hG-CSF was synthesized from partially overlap- 
ping oligonucleotides (-100 bases) that were extended and PCR 
amplified. Codon usage was optimized for B. coU and several 
restriction sites were incorporated to ease future cloning. These 
partial genes were cloned into a vector and transformed into coli 
for sequencing. Several of these gene fragments were then cloned 
into adjacent positions in an expression vector (pET17 or pET21) 
to form the full-length gene for met hG-CSF (528 bases) and 
transformed into E. coli for expression. Protein was expressed in £1 
coli in insoluble inclusion bodies and its identity was confirmed by 
immunoblot of SDS-PAGE using a commercial mAb against 
hG-CSF. 



Refolding, purification, and storage 

The protein inclusion bodies were soliibilized in detergent and 
refolded in the presence of CUSO4 to promote formation of native 
disulfide borids (Lu ct al. 1992). A size-exclusion column (10 
mm X 300 mm loaded with Superdex prep 75 resin purchased from 
Pharmacia) was loaded with protein and eluted at a flow rate of 0.8 
mL/min using the column buffer ( 100 mM NajSO^, 50 mM Tris, 
pH 7 J), The peaks were morutored at dual wavelengths of 214 nm 
and 280 nm. Albumin, carbonic anhydratc, cytochrome C, and 
aprotinin were used to calibrate the n>olecular size of proteins 
versus elution time. The monomcric peak that clutes around the 
expected elution time for each protein was collected and the buffer 
was exchanged into 10 mM NaOAc at pH 4 for biophysical char- 
acterization. For long-term storage, a buffer of 5% soititol. 
0.004% Twccn 80, and iO mM NaOAc at pH 4 was used, A pH of 
4 was chosen for these buffers to be consistent with the commer- 
cial fomiulation of hG-CSF (Amgen). which was used as a control. 
The proteins were >98% pure as judged by reversed phase high 
performance liquid chromatography (HPLQ on a C4 column (3.9 
mm X 150 mm) with a linear acetonitrile-water gradient containing 
0.1% TFE. The identities of all proteins were confirmed by com- 
paring the molecular mass measured by mass spectrometry with 
corresponding molecular mass calculated using the protein se- 
quences. 

Spectroscopic characterization 

Protein samples were 50 p.M in SO mM sodium phosphate at pH 
5.5. Concentrations were determined using U V spectrophotometry. 
Protein structure was assessed by CD. CD spectra were measured 



on an Aviv 202DS spectrometer equipped with a Peltier tempera- 
ture control unit using a I -mm path length cell. Thermal stability 
was assessed by monitoring the temperature dependence of the CD 
signal at 222 nm (Kolvenbach ct al. 1997). A buffer of 10 mM 
NaOAc was used at pH 4.6 and data were collected every Z5'C 
with an averaging time of 5 sec and an equilibration time of 3 min. 
Thermal denaturation curves were smoothed using KaleidaGraph. 
The melting temperature (T„) of each protein was derived fixjm 
the derivative curve of the ellipticity at 222 nm versus temperature. 
The T^ values were reproducible to within 2**C for the same pro- 
tein at the concentrations used. 



Storage stability 

The storage stability of the designed proteins was assessed by 
incubation at both 37'C and SO^C under solution conditions iden- 
tical to that used in the conunercial formulation of hG-CSF (fil- 
grastim. Amgen). Because aggregation and chemical degradation 
are the predominant mechanisms of inactivation of G-CSF (Her- 
man et al. 1996). accelerated degradation was followed by observ- 
ing the disappearance of monomeric protein with both size-exclu- 
sion and reverse-phase chromatography. Rate constants for shelf- 
life estimation were determined by a first-order exponential fit oif 
the fraction monomer remaining/time curves using KaleidaGraph 
(Synergy Software). 



Cell proliferation assay 

Granulopoietic activity was measured by quantifying cell prolif- 
eration as a function of protein concentration using Ba/F3 (murine 
lymphoid) cells stably transfectcd with the gene encoding the hu- 
man Class I C-CSF receptor (Avalos et al. 1995). Cell prolifera- 
tion was detected by 5-bromo-2'-deoxy uridine (BrdU) incorpora- 
tion quantified by a BrdU-specific EUSA kit (Boehringer Mann- 
heim). 



In vivo biological activity 

Granulopoietic activity was determined in the neutropenic mouse 
(Hattori et al. 1990). C57BLy6 mice were rendered neutropenic 
with a single intraperitoneal injection of 200 mg/kg cyclophospha- 
mide (CPA), Beginning 24 h later and for 4 consecutive days, the 
mice were given a daily intravenous injection of 100 |ig/kg of an 
hG-CSF analog, met hG-CSF produced in our laboratory, clini- 
cally available hG-CSF (filgrastim. Amgen). or saline. On day 5, 
6 h after the final dose, the animals were kilted, blood samples 
were collected, and granulopoietic activity was determined by 
counting the number of white blood cells and polymorphonuclear 
neutrophils. 



Pharmacokinetics 

Plasma concentrations of a designed hG-CSF analog or wild-type 
hG-C^F (filgrastim. Amgen) were determined following adminis- 
tration in cynoniolgus monkeys. Animals were given a single in- 
travenous injection of 5 pig/kg or daily subcutaneous injections of 
5 p-g/kg for 28 d. In the intravenous study, blood samples were 
collected at 0 (predose). 5, 15. and 30 min and 1. 2. 4. 6. 8, 12. and 
24 h postdosing. In the subcutaneous studies/ blood samples were 
collected at 0 (predose). I. 2. 4. 6, 8, 12, and 24 h postdosing on 
day I and day 28. All samples were immediately placed on wet ice 
and centrifuged at 28*'C. The resultant plasma was then frozen and 
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stored (-7(rC). Plasma concentrations were determined using an 
enzyme-linked immunosorbent assay (Quantikinc human G-CSF 
EUSA, R&D Systems, Minneapolis, MN), performed per manu- 
facturers instructions except that samf)les were diluted in PBS, 5% 
nonfat dry milk, and 0.05% Twccn 20, and the incubation was 
extended to overnight at 4'*C Plasma concentrations of the de- 
signed hG-CSF analog and filgrastim were estimated from their 
corresponding standard curves. Pharmacokinetic parameters were 
calculated by noncompartmental analysis. The terminal slope (\z) 
was estimated by linear regression through the last time points of 
the log concentration versus time curves and used to calculate the 
terminal half-life (tj/j). The area under the curve from time of 
dosing through the last lime point (AUQ,. J v^as calculated by the 
linear trapezoid method. 
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Proteins from Scratch 

William F, DeGrado 



Not long ago. it seemed Inconceivable that 
proteins could be designed from scratch. Be- 
cause each protein sequence has an astro- 
nomical number of potential QOt\fonnations, 
it appeared that only an experimentalist with 
the evolutionary life span of Mother Nature 
could design a sequence capable of folding 
into a single, well-defined three-dimensional 
stnicturc. But now, on page 82 of this issue, 
Dahiyat and Mayo (/) describe 
a new appioach that makes de 
novo protein design as easy as 
fxmning a computer program. 
Well almost. 

Tlie intellectual roots of this 
new work gd back to the early 
1980s -vAxcxi protein engineers 
first thought about 'designing 
proteins (2). At that point* die 
prediction of a protein's thrcc- 
dimcfuionat structure from its 
sequence alone seemed a diffi- 
cult proposition. However, d»cy 
opiriod that die inverse prob- 
lem — designing an amino acid 
sequence capable of assuming a 
desired three-dimensional sauc- 
turc — would be a more tractable 
problem, because one could 
"over-engineer^ the system to fa- 
vor the desired folding pattern. 
Thus, the problem of dc riovo protein design 
reduced to two steps: selectii\g a desired ter- 
tiary 5tructure and finding a sequence diat 
would subiltze this fold- Dahiyat and Mayo 
have now mastered die second step with spec- 
tacular success. They have distiUed the rules, 
insights, and paradigim gleaned from two de- 
cades of experiments (3) into a single compu- 
tadonal algorithm that predicts an optimal 
sequence for a given fold. Further, when put to 
the test the algorithm actually predicted a 
sequence that folded into the desired three- 
dimensional structure. Thus, the tules of pro- 
tein folding aiKi computattorial methods for 
de novo design may now be sufTtciently de- 
fined to allow d\c cr\ginccring of a variety of 
proteins. 

Dahiyat and Mayo's program divides the 
interactions that stabilize protein structures 
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into three categories; interactions of side 
chairw that arc exposed to solvent, of side 
chains buried in the protein interior, and of 
para of the protein that occupy more intcrfiai- 
cial positions. Exposed residues contribute to 
stability, primarily through conformaftional 
preferences and weakly attractive, solvent- 
exposed polar intieraaions (4). The btirial of 
hydrophobic residues in the well-packed In- 



terior of a protein provides an even more 
powerful driving force for folding. The side 
chaitxs in the Interior of a protein adopt 
unique conformations, the prediction of 
which is a large combinatorial problem. 

One important simplifying assumption 
arose from die early work of Jainln et ol. (5), 
who showed that each individual side chain 
can adopt a limited number of low-energy 
conformatioris (named rotamers), reducing 
the number of probable conformers'available 
to a protein. This work was subsequently ex- 
tei\ded to. the design of proteins containing 
only the most favorable rotamen (6). Al- 
though the side chains in natural proteins 
deviate from ideality in a few cases (compli- 
cating the prediction of the structures of 
natural proteins), these deviations need not 
be considered in the design of idealized pro- 
teins.. Thus, various algorithms have been 
developed to examine all possible hydropho- 
bic residues in all possible rotameric states, to 
find combinations that efficiently fill the in- 
terior of a protein. A complementary ap- 



proach uses genetic methods to exhaustively 
search for sequences capable of filling a pro- 
tein core (7), and this work has been adapted 
for the de novo design of proteins (8). 

Interfacial residues are also quite im- 
portant for protein stability (9, 10). Tlicy 
are often amphiphilic (for example, Lys, 
Arg, and Tyr) and their apoUr atoms can 
cap the hydrophobic core, while their po- 
lar groups engage in electrostatic and hy- 
drogen-bonded interactions. 

Until recently, protein designcn have fre- 
quently concentrated on quantifyii\g the en- 
ergetics associated with just one df these d\rcc 
types of interacdons (3). However, de novo 
design is best apptoached by simultaaeously 
considering all of die side chains in the pro- 
teiih— unfortunately, a very high-order com- 
binatorial problem. For instance, the volume 
available to die interior side 
chains depends on the nature and 
conformation of die residues at 
die interfacial posidons and vice 
versa. Dahiyat and Mayo assumed 
that ea^ of these three features 
had been adequately quandtatcd 
to provide a useful empirical en- 
ergy function for proton design. 
Their program combines a num- 
ber of fcaures taken from eadier 
potential funalons and includes 
a penalty for exposing hydropho- 
bic groups to solvent. Another cs- 
sendal Innovadon included in 
their program is an implementa- 
tion of the Dead-End Elimina- 
tion theorem, to efficiently 
search through sequence and side 
chain rotamer space. 

Dahiyat and Mayo's target 
fold is a zinc finger, a motif with 
a well-established history in protein struc- 
ture prediction and design. In an early, pre- 
scient paper. Berg correcdy Infened that this 
Hls'Cys* 2[n-biriding motif must fea ture a p- * 
P<t fold that would position the ligating 
groups in a tetrahcdral anay around the 
bound Zn(U) (11). Favorable metal ion- 
ligand interactions together with a small 
apolar core help stabilize the three-dimen- 
sional structure of this compact fold. More 
recently, Imperiali and co-workers have de- 
signed a peptide that folded into this motif, 
even in die absence of metal ions ( i 2). The 
design included a D-atnino acid to stabilize a * 
type ir turn, and a large, rigid tricyclic side 
chain that may help consolidate the hydro- 
phobic core. This work was particularly ex- . 
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Better than the real thing. The natural ziric Hnger protein Zif268 (left) Is 
stabilized In part by a core of hydrophobic (green) side chains and metal- 
cheialtng side chains (red). In the tJesigned protein FSD-1 (right), the 
Zff268 core Is retained but the metal-chelattng His residiies and one of the 
Cys residues of Z(f268 are converted to hydrophobic Phe and Ala resi- 
dues, thereby extending the hydrophobic core. The fourth metal ligand 
Cys® is converted to a Lys residue. The apotar portion of this Interfadat 
residue shields the hydrophotMC core, whereas its ammonium group is ex- 
posed to solvent. The helix is also stabilized by an N-capping interaction 
(IS), which presumably also stabilizes the structure. 
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citingbecause. before their studie5| it was not 
expected that sequences as short as 25 resi- 
dues in length could fold into stable tertiary 
stwcturcs- 

Now. Dahiyat and Mayo take these studies 
one step further thtough the design of a se- 
quence composed of only natural amino acids 
that adopts Ac linc finger motif. As input to 
their program, they Introduced the coordi- 
nates of the backbone atoms from the crysul 
stmctute of the second domain of die linc 
finger protein Zif268. The program then 
evaluated a total of 10" possible side chaitv- 
rotamer combinations to find a sequence ca- 
pable of stabilizing this fold without a bound 
metal ion. The resulting protein sequence 
shares a small hydrophobic core with its pre- 
decessor ftom Zi<268. However, In the newly 
designed protein FSD-I the core is enlarged 
through the addition of hydrophobic resi- 
dues that fill the space vacated by the re- 
moval of the metal-binding site (sec the fig- 
ure). This increase in the slic of the hydro- 
phobic core together with the enhancements 
in the propctuity for forming the appropriate 
$ccof\daty structure provide an adequate 
driving force for folding. The designed 
minlpcotein actually folds into the desired 
structure as assessed by nuclear magnetic 
resonance spectroscopy, and the observed 
stmcture closely resembles the three-dimen- 
sional structure of Zif268. 

Because of its small site, the protein is 
maiginally stable. A Van't Hoff analyste of 
die thermal unfolding curve gives a charxge 
in the enthalpy (AHvh) of approximately 
-10 kicat/tnol, and indicates diat the protein 
is about 90 to 95% folded at low tempera- 
tures (13). The small value AH^h 
lack 6f strong coopcrativity in the unfolding 
transition are expected for a rwtive-like pro- 
tein of thU very small sire (14). Thus, FSD-1 
is the smallest protein known to be capable 
of folding into a unique structure without the 
thermodynamic assistance of disulfides, 
metal iot«, or other subunits. This itnportatit 
accomplishment Illustrates the impressive 
ability of Dahiyac and Mayo's program to 
design highly optimized sequences. 

This new achievement caps a banner year 
for de novo protein design. Eadicr, Regan (15) 
answered the challenge of changing a protein's 
tettiary stiucture by altering no more than 50% 
of its sequence. And although Dahiyat and 
Mayo have demonstrated dut the stabilizing 
me^-blnding site Is not necessary in Aeir sys- 
tem, Caradonna, Hellinga, dnd co-wotkcrs 

(16) have made impressive progress in auto- 
mating the introduction of functional metal- 
binding sites Into the three-dimeiuional struc- 
tures of natjiiralptotdns. Further, other workers 

(17) have used less automated approaches to 
cucccssiidly iiuroduce functionally and spec- 
aoscopically Interesting metal-binding sites 
into dc novo designed proteins* 



To date, the most computationally inten- 
sive protein design problems have been die 
redesign of natural proteins of known three- 
dimeiuional stmcture. But the new automated 
approaches open the door to the de novo design 
of structures with entirely novel backbone con- 
formatioi\s. It will be interesting to sec if 
Dahiyat and Mayo's approach of designing an 
optimal sequence for a given fold is suflficicnt, 
or if it he necessary also to destabilize alter- 
nate possible folds. Indeed, when usir^g an ear- 
lier version of dieir algoridun to repack die 
intcriorof die coiled coil fromGCN4, dicy had 
to retain the identity of a buried Asn residue 
from d\e wUd-typc protein. Atdu>ugh die In- 
chisUm of this Asn actually destabilized die 
desired fold, it was, nevertheless essential to 
avoid die formation of altetnate, unwanted 
conformets (18). The ability to ask such fo- 
cused questions will reveal much about how 
natural proteins adopt dicir folded confwrna- 
dons vt^lc simultaneously allowing.dic design 
of entirely newpolymets forapplications rang- 
Ing from catalysis to pharmaceuticals. 
/ 
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Combinatorial protein design 

Jeffery G Saven 

Combinatorial protein libraries permit the examination of a wide 
range of sequences. Such methods are being used for de novo 
design and to investigate the determinants of protein folding. 
The exponentially large number of possible sequences, however, 
necessitates restrictions on the diversity of sequences in a 
combinatorial library. Recently progress has been made in 
developing theoretical tools to bias and characterize the 
ensemble of sequences that fold into a given structure - tools 
that can be applied to the design and interpretation of 
combinatorial experiments. 
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Introduction 

The discovery and design of novel proteins can lead to 
new, potentially practical proteins and can also enhance 
our understanding of protein biochemistry. Designing 
well-structured, soluble proteins is difficult, however, 
because of their complexity. Such proteins are large (tens 
to hundreds of amino acid residues) and have many variables 
that specify the folded state, including sequence, backbone 
topology and sidechain conformation. Design involves 
identifying those sequences that fold into a given structure 
from a huge ensemble of possible sequences. This search 
is aided, in part, by the large degree of consistency seen in 
folded proteins. On average, a folded structure is well 
packed, hydrophobic residues are sequestered from solvent 
and most potential hydrogen bond interactions are satisfied. 
This consistency, however, is often complex, may have 
little simplifying symmetry and involves predominantly 
noncovalent interactions. Such interactions are some of the 
most difficult to accurately quantify. As such, estimating 
the free energies associated with mutation or structural 
ordering remains a subtle area of computational research. 
Nonetheless, many molecular potentials do contain a *best 
parameterization' of many of the interatomic interactions 
and forces that we know are important for stabilizing 
proteins. In some cases, such potentials have been used with 
striking success in protein design [I**], Given that these 
potentials are necessarily appro^cimate, however, one 
promising approach is to use the partial information con- 
tained in these functions in a probabilistic manner, A 
probabilistic or statistical approach is also appropriate for 
characterizing the full variability of sequences that fold 
to a common structure, because there are likely to be an 
enormous number of such sequences. Such statistical 
methods can be applied in *shotgun' approaches to de novo 
protein design. Combinatorial experiments create and assay 
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many sequences in order to overcome shortcomings in 
our understanding of folding or other molecular properties. 
Even though combinatorial methods can address large 
numbers of sequences (lO'^-lO*^), these numbers are still 
infinitesimal in comparison to the numbers of possible 
sequences (e.g. 20»oo « 10»^o for a 100-residue protein). Thus, 
methods for winnowing and focusing sequence space are 
a vital component of combinatorial protein design. Herein, 
I briefly discuss combinatorial methods for full sequence 
design. I also review recent theoretical developments 
in characterizing sequence ensembles — developments 
that can be applied to the design and interpretation of 
combinatorial experiments. 

Directed protein design 

There has been much effort — and success — in developing 
computational methods for *directed* protein design. By 
'directed protein design*, I mean the identification of a 
sequence (or a small set of sequences) that is likely to 
fold into a predetermined backbone structure. Each such 
sequence can then be synthesized to confirm its folded 
structure and other molecular properties. Early efforts in 
design identified proteins with substantial order, but not 
necessarily well-defined tertiary structure [2]. Because an 
enormous number of sequences are possible even for 
small proteins (<50 residues), computational methods 
have dramatically accelerated successful design. Typically, 
such methods are implemented as an optimization 
process, whereby amino acid identity and sidechain 
conformation are varied in order to optimize a scoring 
function that quantifies sequence/structure compatibility. 
Exhaustiye searching of all possible sequences (where 
m is the number of different amino acid types or 'states' 
per residue and N is the number of residues in a target 
protein structure) is feasible only if a small number of 
residues N are allowed to vary or if the number of amino 
acids m is greatly reduced. If, in the optimization process, the 
different sidechain conformations (rotamer states) of each 
amino acid are also considered (see [3]), the complexity of 
the search increases still further, because m, the number of 
possible 'states' per residue, increases by a factor of ten 
or more, Although complete enumeration is typically not 
feasible, sequence space can be sampled in a directed 
manner in order to find optimal (or ncariy optimal) 
sequences. Stochastic methods, such as genetic algorithms 
or simulated annealing, involve searching sequence space 
in a partially random fashion; on average, the search 
progressively moves toward better scoring (lower energy) 
sequences [4.5], The partially random nature of the search 
permits escape from local minima in the sequence/rotamer 
landscape. Using a simplified model, the Tkkada and Tamura 
groups have included information about unfolded structures 
(negative design) in a stochastic search for a sequence with 
a *funneled conformational energy landscape' [6], One 
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47-rcsidue three-helix bundle protein so selected has 
CD and NMR spectral features of folded proteins (W Jin, 
O Kambara, H Sasakawa, A Tamura, S Takada, personal 
communication). When applied to atomically detailed 
representations, the stochastic methods focus primarily 
on repacking the interior of a structure with hydrophobic 
residues [7] and have been applied to the wild-type 
structures of 434 Cro [8J, ubiquitin [9], the Bl domain of 
protein G (lO*), the WW domain [V] and helical bundles 
[11,12]. Although, in many cases, these methods have 
identified experimentally viable sequences (1**,13], sto- 
chastic search methods need not identify global optima [14*]. 
For potentials comprising only site and pair interactions, 
elimination methods such as 'dead end elimination* can find 
the global optimum [14M5-17], Such methods successively 
remove individual amino acid rotamer states that cannot be 
part of the global optimum until no further states can be 
eliminated The Mayo group applied such methods to 
automate the full sequence design of both a 28-residue 
zinc finger mimic [18] and, after predetermining hydro- 
phobic and polar sites, a 51-residue homeodomain motif 
[19*]. The group has also redesigned portions of a variety 
of proteins [20-22J. Functional properties such as metal 
binding or catalysis may also be included as elements of 
the design process [23,24']. The elements and algorithms 
of directed protein design have been the subject of several 
recent reviews [1**,25,26*]. 

Despite some striking successes, computational methods 
for directed design have limitations with respect to both 
identifying folding sequences and characterizing the 
features of protein sequences that share a common structure. 
Stochastic methods, such as simulated annealing or genetic 
algorithms, can be applied to large proteins and permit 
many sites to be varied simultaneously, but the compu- 
tational times and resources required for such calculations 
are extensive, even for small proteins. When used as 
optimization methods, directed approaches will necessarily 
be sensitive to the energy or scoring function used. All 
energy functions in use in protein design, however, are 
necessarily approximate and uncertainties in the energy 
function may not merit the search for global optima. 
Furthermore, many naturally occurring proteins are riot 
optimized. In fact, most proteins are only marginally stable 
(e,g, AGO <10 kcal/m'ol for folding) [27]. In addition, 
sequences that function, for example, those that bind another 
molecule, need not be the global optimum with respect to 
struaural stability. Although stochastic methods can sample 
such suboptimal sequences, in general an exponentially 
large number of them will be possible and such sampling 
will be time consuming. Thus, it is important to develop 
methods complementary to those used for directed protein 
design — methods that reveal the features of sequences 
that are likely to fold into a particular structure but that may 
not be structurally 'optimar. Such computational methods 
will have application to a new class of protein design studies, 
combinatorial experiments, in which large numbers of 
proteins may be simultaneously synthesized and screened. 



Combinatorial design 

Combinatorial design provides a complementary approach 
to directed design for understanding sequence/ structure 
compatibility and discovering novel sequences that fold 
into a specific structure. Combinatorial methods are 
powerful tools for cases in which we have an incomplete 
understanding of molecular properties. In protein combi- 
natorial design experiments, large numbers of sequences 
(libraries) are screened for evidence of folding into a 
predetermined structure. A combinatorial experiment has 
two key elements: creating a library with a desired degree 
of diversity and assaying for sequences with *protein-Iike' 
properties in terms of their structure or function. Depending 
upon how the diversity is generated and assayed, experi- 
ments of this type can explore a large number of sequences, 
up to 10»2 [28*]. Certainly, such methods can be used to 
discover 'hits', that is, a few sequences that are especially 
stable or that are unusually strong in their function or 
binding properties. In addition, combinatorial experiments 
readily generate a sequence ensemble. Thus, using combi- 
natorial experiments, we can potentially 'expand the protein 
sequence database* and the diversity of these additional 
sequences will be at the control of the researcher. Features 
important to folding (and other properties) may be explored 
in a way that is decoupled from the evolutionary require- 
ments of nature's proteins. For example, these methods 
have been used to identify helical proteins [29-31], 
ubiquitiri variants [32], self-assembled protein monolayers 
[33], proteins with amyloid-like properties [33], metal- 
binding peptides [34] and stable interhelical oligomers [35]. 
Several excellent reviews of combinatorial experiments have 
appeared recently [36,37,38*,39**], 

The complexity of combinatorial experiments implies that 
limitations must be placed on the sequences, because the 
number that can be created and screened (10^-10*2) is 
infinitesimal compared to the number possible (e.g. 
Limitations on sequence properties are often guided by 
qualitative chemical considerations, but quantitative 
computational methods will be helpful in designing and 
interpreting combinatorial experiments. 

The Hecht group has probed the extent to which the 
patterning of hydrophobic and hydrophilic residues can 
successfully reduce complexity in combinatorial design. 
While maintaining the periodicity of a helices and P sheets 
in particular tertiary structures, such patterning is applied 
in order to expose hydrophilic residues to solvent and to 
sequester hydrophobic residues in the interior of the 
protein. Early targets were helical proteins; a fiducial 
74-residuc four-helix bundle was the template structure [40]. 
Such a structure has more than Z(P^^iQP^ possible sequences. 
After binary patterning, five hydrophobic and six 
hydrophilic amino acids were permitted at 24 interior 
and 36 exterior positions, reispecdvely, thus reducing the 
total number of possible sequences to lO^J. From a pro- 
tein library consistent with this binary patterning, a set of 
50 correcdy expressed sequences was selected for further 
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study. Around half of the 50 sequences isolated are protein- 
like in many respects [30], including their thermal 
denaturation [41], About half the isolated sequences 
also bind heme [29] and many of these display carbon 
monoxide binding [42*] or peroxidase activity [43]. This is 
surprising given that such functions were not part of the 
design or selection of the sequences. In a second -generation 
design, the group added six residues to each of the four 
helices of one of the most protein-like sequences. The 
additional residues were combinatorially patterned, as 
in the original experiment [39**]. For these 102-residue 
sequences, the free energies of folding are increased 
2-3-fold and the NMR data suggest well-determined 
structures. Using binary patterning of hydrophobicity 
consistent with an amphilic p sheet [44], the Hecht group 
has also identified proteins that aggregate to form amyloid 
fibrils [45] and crafted monomeric P proteins by introducing 
a nonpolar lysine mutation at the 'edge' strand of the 
target P sheet [46**]. 

Despite the striking results from hydrophobic patterning, 
more detailed methods for library design are merited. Many 
of the hydrophobically patterned sequences that appear 
well structured are not sufficiently soluble for NMR 
structure determination [46**] and, as a result, little is 
known concerning their structures at the atomic scale. Not 
all of the a-helical sequences exhibit the sharp thermal 
transition seen in natural proteins (usually associated with a 
large AH of folding). Such sequences may not possess 
well-packed interiors [41], In natural proteins, the side- 
chains of most interior residues are well determined, as 
opposed to the variability that is obtained using hydrophobic 
patterning alone and that is observed in many ^^ novo 
designed proteins (13,18). A more fine-grained dictation of 
the amino acid identities is probably necessary for obtaining 
libraries that are rich in sequences with well-defined struc- 
tures. Moreover, a more detailed specification of amino acid 
identities yields fewer sequences dian hydrophobic pattern- 
ing alone and further reduces the complexity of the library 

Theories of combinatorial libraries 

Surveying the complete sequence landscape of proteins 
seems, at first glance, intractable to both experiment and 
computation. In addition to the enormous number of 
possible sequences, many examples exist in nature of dis- 
similar sequences folding to essentially the same structure. 
Hence, sequence properties are nontrivial and proteins 
sharing a common structure cari be nonlocal in sequence 
space. Nonetheless, computational methods permit us to 
estitnate the properties, particularly the amino acid proba- 
bilities, of sequences consistent with a target structure. 

Repeated use of directed search methods can estimate the 
properties of an ensemble of sequences, Desjarlais and 
co-wprkers have used independent runs of their sequence 
prediction algorithm across an ensemble of closely 
related stnictures all consistent with a particular fold 
OR Desjarlais ^/ a/,, personal communication). For each 



structure, an optimal ^nucleating' sequence is identified 
and subsequently the sequence/rotamer variability is 
explored throughout the structure. The method identifies 
effective reduced partition sums for each sequence/rotamer 
state and amino acid probabilities may be obtained at each 
residue position. The number of sequences decreases 
with stability, so the degree of complexity can be tuned by 
varying a cutoff in the effective free energies of the 
sequences. The method has been used to identify sequences 
consistent with the fold of a WW domain, a small p-sheet 
protein [1**], some of which are currently being experi- 
mentally characterized. 

The amino acid frequencies can also be determined 
directly, using a statistical theory of combinatorial libraries 
[47,48**,49**], Ideas from statistical mechanics are used to 
address the number and composition of sequences that are 
consistent with a particular backbone structure. The theory 
addresses the whole space of available compositions, not 
just the small fraction that is accessible to experiment and 
to computational enumeration and sampling. The theory 
takes as input a target backbone structure and a scoring 
or energy function for quantifying sequence/structure 
compatibility Global and local features can be prespecified 
using constraints on the sequences. For example, such 
constraints can be used to determine the energy the 
sequences assume in the target structure, the patterning of 
amino acids and the number of each amino acid present 
(composition). The theory yields estimates of both the 
number of sequences consistent with these constraints and 
the amino acid probabilities at each residue position. 
These residue-specific probabilities are the most probable 
such set and are determined — as in statistical mechanics 
— by maximizing an effective entropy, whereby this 
maximization is subject to constraints. Just as in thermo- 
dynamics, the judicious use of constraints can be used to 
reduce the entropy or the number of possible sequences. 
Thus, these methods provide a systematic means to focus 
the library, wirinowing numbers such as 10*^0 to numbers 
that are experimentally manageable, for example, 10^ The 
theory agrees well with exact results obtained with lattice 
models of proteins [47,48*M. This method has been 
extended to realistic representations of proteins, in 
which the effects of sidechain packing are included in 
an atom-based manner [49'*]. The calculated sequence 
probabilities of the immunoglobulin light chain binding 
domain of protein L are in agreement with the frequencies 
observed in combinatorial phage display experiments [50,51]. 
These statistical methods have several advantages. They 
may be applied to much larger proteins (A''>100 residues) 
and permit much larger sequence variation than many 
directed methods. They are sufficiently rapid that many 
backbone structures may be considered and those features 
that are robust with respect to minor structure modifications 
may be identified. Importantly, such methods provide 
perhaps the most natural input for a combinatorial exper- 
iment, the probabilities of the amino acids at each position 
among the sequences of a library. These amino acid 
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probabilities can also be used to identify specific amino 
acid sequences, which can then be synthesized; a consensus 
sequence comprising the most probable amino acid at each 
site can be selected or the probabilities can be used to bias 
a stochastic search for viable sequences (J Zou, JG Saven, 
unpublished data). 

If the energy of the target state is one of the constraints, 
the statistical method reduces to an effective mean field 
theory. Mean field theories have seen extensive application 
in physical science and in biomolecular theory [52], and 
to protein evolution and natural sequence variability ([53]; 
H Kono, JG Saven, unpublished data). Voigt etaL [14'] have 
compared mean field theories with directed search 
methods for identifying ground state sequence/rotamer 
combinations in protein design. They found that, although 
often more rapid, mean field theories do not always identify 
such ground states. Interestingly, Voigt /r/^. applied the mean 
field theory to large proteins (subtilisin E and T4 lysozyme) 
to determine local site entropies, s^, where exp(j,) quantifies 
the cffecrive number of amino acids allowed at residue / 
in a structure [54**,55]. Sites with large values of jy, those 
most tolerant to mutation (56], are likely to support sub- 
stitutions that improve stability or function when in vitro 
evolution experiments are used to explore sequence 
space [37]. For such experiments, the mutation rate is low 
enough that multiple mutarions of strongly interacring sites 
are rare. Thus, mutarions that improve Titness' are most 
likely to accumulate at sites that are the most 'decoupled' 
from other sites. Such mutations can potentially be targeted 
for variarion in an in vitro evolurion experiment. 

Conclusions 

Much recent progress has been seen in the design and 
discovery of new proteins, and combinatorial approaches 
are acceleradng the pace. Such methods are most useful 
when our quanritarive understanding of important protein 
properties, such as stability and catalytic activity, is limited. 
Not only can combinatorial methods be used for discovery 
but also, more deeply, they can inform our understanding 
of protein properties by generaring and assaying whole 
ensembles of sequences. Tradirionally, advances in 
structural biology have come from examining the structures 
of naturally occurring proteins, but, with combinatorial 
experiments, an enormous diversity of sequences can be 
generated at the control of the researcher Detailed ques- 
tions can be addressed, such as the utility of hydrophobic 
patterning or of predetermining particular sites for amino 
acid variation. Theory and simulation will continue to aid 
the design and interpretation of combinatorial experiments. 
Such methods will also facilitate the exploration of what is 
possible with the amino acids: how diverse is the set of all 
possible sequences that fold to a particular structure and 
what structures not yet seen in nature can be crafted with 
the amino acids? Such methods will perhaps have an even 
more profound impact on designing nonbiological 
foldamcrs [57"], structures about which we have much less 
empirical information than wc do about biopolymers. 
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Quaternary 6tructui«; 
the four esparate chains © 
of hemoglobin assembled ^ 
into an oliggmeric protein 



ProUm function can only be understood in terms ofvro- 
tem structure, that is, the three-dimensional relation- 
smps between « protein's component atoms. The struc- 
tural descnptions of proteins, as well as those of other 

^;S[II!!"^"I?®**'^*^' traditionany described 

m terms of four levels of organisation (Hg, 6-1): 

1' A protein's primary struclure (l* stnicture) is the 
ammo add sequence of its polypeptide chain(s). 

2. Secondary (2 ') structure is the local spatial arrange- 
ment of a polypeptide's ba<*bone atoms without re- 
gara to the conformations of its side chains. 

3. TerHary (3') structure refers to the three-dimen- 
«°rtl?^^'*°f«""^'*'^PoIypeptide.Thedisti^^^ 
Don benveen secondaiy and tertlaiy struchtrea is, of 
necessity somewhat vague; in practice, the term sec- 
hZ''^ ^^^^ a»«<ies to easUy characterized stmc- 
tural entities such as heUces. 

^ iS.S.P"'?."' composed of two or more poly- 

wwch aasooate through noncovalent interacHons 
Se^I A protein's 

'^'^'^ '^^'^ '° «P«ti«» ar- 
rangement of its subunits. 



In this, the first of four chapters on protein struchiie, 
we discuss the T stnichires of proteins: How they are 
elucidated and their biological and evolutionary signifi- 
cance. We also survey methods of chetnicaUy synthesiz- 
ing Polypeptide chains. The 2', 3% and 4' struchires of 
proteins which, as we shaU see, are a consequence of 
their 1 stnictures, are treated in Chapter 7. In Chapter 8 
we take up protdn folding, dynamics, and stru^al 
evoliition, and in .Chapter 9 we analyze hemoglobin as a 
paradigm of protein struchire and function. 

1. PRIMARY STRUCTURE 
DETERMINATION 

The first deteimination of the complete amino add 
sequence of a protein, that of the bovine polypepride 
hormone ^sulin by Frederick Sanger in 1953, was of 
enormous biochemical significance in that it definitivelv 
established that proteins have unique covalent stnic^ 
tures. Smoe that tune, the amino add sequences of sev- 
eral thousand proteins have been eluddated. This ex 
tensive information has been of omtral Importance in 
the fonnulahon of modem concepts of biodlemistry f^ 
several reasons: ^ '"'^ 
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